Build a Production Agent Loop with Ollama Tool Calling: Complete Guide
The bottom line: Ollama’s native tool calling (function calling) API lets you build a local agent loop that executes multi-step reasoning with tool use, parallel dispatch, and error recovery — no GPU cluster needed. This guide walks through building a production-grade agent loop with Qwen 3, covering the core loop, parallel tools, streaming, rate limiting, and deployment on modest hardware.
Why Local Tool Calling Matters
Running agent workloads locally isn’t just about cost savings or privacy. For many production patterns — CI/CD integrations, local file operations, development tooling — a remote API introduces latency, rate limits, and data exfiltration risk. Ollama’s tool calling API, available since v0.3.0, gives you the same function-calling primitive as OpenAI and Anthropic, but running on your own hardware [1].
The models that support it in 2026: Qwen 3 (all sizes), Llama 3.1/3.3, Gemma 4, Mistral Small 3.1, DeepSeek R1, and Hermes 3 [2]. Qwen 3 is the go-to for agent workloads — best tool selection accuracy per parameter, and its think=True mode reveals intermediate reasoning, which is invaluable for debugging agent behavior.
The Core Agent Loop
A production agent loop needs more than the minimal example from the docs. Here’s the pattern with error boundaries, loop limits, and structured tool dispatch:
import json
from typing import Any, Callable
from ollama import chat, ChatResponse
from tenacity import retry, stop_after_attempt, wait_exponential
MAX_ITERATIONS = 15
class AgentLoop:
def __init__(self, model: str = "qwen3", tools: dict[str, Callable] = None):
self.model = model
self.tools = tools or {}
self.tool_schemas = self._build_schemas()
def _build_schemas(self) -> list[dict]:
# Tools are passed as Python functions — Ollama SDK auto-generates JSON schemas
return list(self.tools.values())
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10))
def _llm_call(self, messages: list[dict]) -> ChatResponse:
return chat(
model=self.model,
messages=messages,
tools=self.tool_schemas,
options={"temperature": 0.1}, # low temp for deterministic tool selection
think=True, # surface reasoning for debugging
)
def run(self, user_input: str) -> str:
messages = [{"role": "user", "content": user_input}]
for iteration in range(MAX_ITERATIONS):
response = self._llm_call(messages)
messages.append(response.message)
if not response.message.tool_calls:
return response.message.content # done
for call in response.message.tool_calls:
fn_name = call.function.name
fn_args = call.function.arguments or {}
try:
result = self.tools[fn_name](**fn_args)
except Exception as e:
result = f"ERROR: {e}"
messages.append({
"role": "tool",
"tool_name": fn_name,
"content": str(result),
})
return "Agent reached max iterations without final answer."
Key differences from the minimal example [1]:
- Retry wrapper — transient Ollama API errors (model loading, OOM) get exponential backoff
- Temperature 0.1 — deterministic tool selection avoids hallucinated function names
- Max iterations — prevents infinite loops on confused models
- Error boundaries per tool — a crashing tool doesn’t kill the entire agent
think=True— exposes model reasoning, critical for debugging tool selection
Parallel Tool Dispatch
When the model needs data from multiple sources — weather in three cities, file stats across directories — the agent loop shouldn’t wait for each tool sequentially. Ollama supports parallel tool calls: the model returns an array of tool_calls in a single response [1].
def run_parallel(self, user_input: str) -> str:
messages = [{"role": "user", "content": user_input}]
for iteration in range(MAX_ITERATIONS):
response = self._llm_call(messages)
messages.append(response.message)
if not response.message.tool_calls:
return response.message.content
# Dispatch all tools in parallel
tool_results = []
for call in response.message.tool_calls:
fn_name = call.function.name
fn_args = call.function.arguments or {}
try:
result = self.tools[fn_name](**fn_args)
except Exception as e:
result = f"ERROR: {e}"
tool_results.append({
"role": "tool",
"tool_name": fn_name,
"content": str(result),
})
messages.extend(tool_results)
For truly parallel execution (I/O-bound tools like API calls), use concurrent.futures.ThreadPoolExecutor:
from concurrent.futures import ThreadPoolExecutor, as_completed
def _dispatch_tools(self, calls: list) -> list[dict]:
results = []
with ThreadPoolExecutor(max_workers=5) as pool:
futures = {
pool.submit(self._execute_tool, c): c
for c in calls
}
for future in as_completed(futures):
call = futures[future]
try:
result = future.result()
except Exception as e:
result = f"ERROR: {e}"
results.append({
"role": "tool",
"tool_name": call.function.name,
"content": str(result),
})
return results
The model handles parallel results naturally — it sees all tool outputs in the next turn and synthesizes them into a coherent answer. Testing with qwen3 shows it correctly correlates results from parallel calls with their original context [3].
Streaming: Real-Time Output for UX
For interactive applications, streaming gives the user incremental progress instead of a long wait:
def run_streaming(self, user_input: str):
messages = [{"role": "user", "content": user_input}]
for iteration in range(MAX_ITERATIONS):
stream = chat(
model=self.model,
messages=messages,
tools=self.tool_schemas,
stream=True,
options={"temperature": 0.1},
)
full_response = ""
tool_calls = None
for chunk in stream:
if chunk.message and chunk.message.content:
yield ("delta", chunk.message.content)
full_response += chunk.message.content
if chunk.message and chunk.message.tool_calls:
tool_calls = chunk.message.tool_calls
if tool_calls:
yield ("tool_calls", [
{"name": tc.function.name, "args": tc.function.arguments}
for tc in tool_calls
])
messages.append({
"role": "assistant",
"tool_calls": [
{"type": "function", "function": {
"name": tc.function.name,
"arguments": json.dumps(tc.function.arguments),
}}
for tc in tool_calls
]
})
# Execute and append tool results
for tc in tool_calls:
try:
result = self.tools[tc.function.name](**tc.function.arguments)
except Exception as e:
result = f"ERROR: {e}"
messages.append({"role": "tool", "tool_name": tc.function.name, "content": str(result)})
yield ("tool_result", {"name": tc.function.name, "result": result})
else:
yield ("done", full_response)
return
Consumers can render this in a terminal with rich text or a web UI with server-sent events:
for event_type, data in agent.run_streaming("What's the disk usage on /data?"):
if event_type == "delta":
sys.stdout.write(data)
sys.stdout.flush()
elif event_type == "tool_calls":
print(f"\n[Calling: {data[0]['name']}]\n")
elif event_type == "tool_result":
print(f"\n[Result: {data['result'][:80]}...]\n")
Error Recovery Strategies
Models sometimes get tool calling wrong. Here’s how to handle the three most common failure modes:
1. Hallucinated function names
A model might call get_weather when only get_temperature exists:
def _execute_tool(self, call) -> str:
fn_name = call.function.name
if fn_name not in self.tools:
return f"ERROR: Unknown function '{fn_name}'. Available: {list(self.tools.keys())}"
return str(self.tools[fn_name](**call.function.arguments))
Returning the correct function list lets the model recover on the next turn.
2. Missing required parameters
def safe_invoke(fn: Callable, args: dict) -> str:
import inspect
sig = inspect.signature(fn)
required = {
name for name, param in sig.parameters.items()
if param.default is inspect.Parameter.empty
}
missing = required - set(args.keys())
if missing:
return f"ERROR: Missing required parameters: {missing}"
return str(fn(**args))
3. Model stuck in tool-calling loop
Sometimes the model calls tools forever without generating a final answer. The max-iteration guard handles this, but you can also detect the pattern mid-loop:
def _detect_loop(self, history: list[dict]) -> bool:
"""Detect repeated identical tool calls"""
recent = [m for m in history[-6:] if "tool_calls" in (m.get("role") and m)]
if len(recent) >= 4:
call_sigs = [str(m.get("tool_calls", [])) for m in recent]
return len(set(call_sigs)) <= 2 # same tools repeated
return False
Model Selection Guide
Not all models handle tool calling equally. From real-world testing and community reports [2][3]:
| Model | Params | VRAM | Tool Accuracy | Notes |
|---|---|---|---|---|
| Qwen 3 | 8B-235B | 6GB+ | Excellent | Best overall, strong parallel support |
| Llama 3.1 8B | 8B | 6GB | Good | Fast, reliable for simple tool sets |
| Gemma 4 | 9B-27B | 7GB+ | Very good | Strong reasoning with tool use |
| DeepSeek R1 14B | 14B | 10GB | Good | Slower but thorough reasoning |
| Hermes 3 70B | 70B | 40GB | Excellent | Best accuracy, needs high-end GPU |
For development on consumer hardware (16GB VRAM or less), start with Qwen 3 8B (5.5GB) or Qwen 3 14B (9GB). For production accuracy on sensitive workflows, Qwen 3 32B (20GB) strikes the best accuracy-to-cost ratio [3].
Production Deployment
For a production agent service, run Ollama behind a simple queue to prevent VRAM contention:
from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
import asyncio
app = FastAPI()
agent = AgentLoop(model="qwen3:14b", tools=my_tools)
class Query(BaseModel):
input: str
stream: bool = False
@app.post("/agent")
async def run_agent(query: Query):
if query.stream:
return StreamingResponse(agent.run_streaming(query.input), media_type="text/event-stream")
result = await asyncio.to_thread(agent.run, query.input)
return {"result": result}
Critical deployment rules for Ollama in production:
- One model at a time per GPU — Ollama keeps the loaded model in VRAM. Concurrent requests to different models cause thrashing.
- Set
OLLAMA_NUM_PARALLEL— controls concurrent request handling for the same model. Start at 1, increase to 2-4 for latency-tolerant workloads. - Systemd keepalive — run
ollama serveas a systemd service withRestart=alwaysand a health check every 30 seconds. - Health endpoint —
http://localhost:11434/api/tagsreturns JSON. Monitor it in your load balancer.
# systemd unit snippet
[Service]
ExecStart=/usr/bin/ollama serve
Environment=OLLAMA_NUM_PARALLEL=2
Environment=OLLAMA_KEEP_ALIVE=24h
Restart=always
RestartSec=5
Putting It All Together
Here’s a complete working example — a file analysis agent that reads directories, checks disk usage, and searches file contents. Pull qwen3:14b and run it:
import os, glob
from agent_loop import AgentLoop
def list_files(path: str = ".") -> str:
"""List files in a directory"""
return "\n".join(os.listdir(path)[:30])
def disk_usage(path: str = ".") -> str:
"""Check disk usage of a path"""
import shutil
total, used, free = shutil.disk_usage(path)
return f"Total: {total//(2**30)}GB, Used: {used//(2**30)}GB, Free: {free//(2**30)}GB"
def search_files(pattern: str, path: str = ".") -> str:
"""Search for files matching a glob pattern"""
return "\n".join(glob.glob(os.path.join(path, pattern))[:20])
tools = {
"list_files": list_files,
"disk_usage": disk_usage,
"search_files": search_files,
}
agent = AgentLoop(model="qwen3:14b", tools=tools)
result = agent.run("Find all Python files in /data, then check disk usage")
print(result)
The agent will: call search_files("**/*.py", "/data") → get results → call disk_usage("/data") → synthesize both into a human-readable answer. All locally, no API keys needed.
Verdict
Ollama’s tool calling API in 2026 is production-ready for agent workloads that need local execution, data privacy, or low latency. Qwen 3 provides reliable tool selection at every model size tier. The patterns above — parallel dispatch, streaming, error recovery, and deployment hardening — turn the basic doc example into a real agent service.
The biggest remaining risk is model confusion on complex multi-tool workflows. Test your specific tool set exhaustively before deploying. For most single-domain agent tasks (file ops, code analysis, local API orchestration), the local loop is faster, cheaper, and more private than any remote API.
References
[1] Ollama Tool Calling Documentation — https://docs.ollama.com/capabilities/tool-calling
[2] “Best Ollama Local Models for Tool Calling 2026” — Clawdbook — https://clawdbook.org/blog/openclaw-best-ollama-models-2026
[3] “Ollama Tool Calling Guide: Build AI Agents with Local LLMs” — Local AI Master — https://localaimaster.com/blog/ollama-tool-calling-guide
[4] Qwen 3 Model Card — Ollama Library — https://ollama.com/library/qwen3
[5] Ollama GitHub Repository — Server Configuration — https://github.com/ollama/ollama
📖 Related Reads
- Ollama vs llama.cpp vs MLX: Edge Inference in 2026 — Framework comparison for local model deployment
- Building Custom MCP Servers: Production Deployment Patterns — Deploying tool servers for agent consumption
- Building a Production-Ready MCP Server — Full-stack MCP server implementation
Cross-links automatically generated from NiteAgent.
← Back to all posts