feat(metrics): implement Prometheus observability by MatteoMori · Pull Request #45 · kagent-dev/tools

MatteoMori · 2026-02-15T19:23:13Z

While working with the kAgent Tool MCP Server, I noticed the absence of basic Prometheus metrics. Personally, I would find very useful to know things like: what tools am I exposing?, are invocation failing more than usual?, etc so that I could start to build operational best practices around the tool.

So I spent a little bit of time adding some basic metrics in this project.

What does this PR add?

[x] Prometheus server: it supports to run on the MCP port, or a custom one
[x] 4 initial metrics:
- kagent_tools_mcp_server_info - Server metadata (version, commit, build date)
- kagent_tools_mcp_registered_tools - Gauge per tool (tool_name, tool_provider)
- kagent_tools_mcp_invocations_total - Counter of all invocations ( DISCLAIMER: OPUS helped a lot here )
- kagent_tools_mcp_invocations_failure_total - Counter of failures ( DISCLAIMER: OPUS helped a lot here )
[x] updated the Helm chart
[x] added a basic Grafana dashboard

Replace generateRuntimeMetrics() with prometheus/client_golang and add flexible metrics server architecture supporting same-port or dedicated port deployment. Changes: - Add internal/metrics package with custom Prometheus registry - Configurable metrics port via --metrics-port flag (default: 8084) - Two-server architecture with proper WaitGroup coordination - Graceful shutdown for both main and metrics servers - Export kagent_tools_mcp_server_info (version metadata) - Export kagent_tools_mcp_registered_tools (tool providers) - Include Go runtime metrics (goroutines, memory, GC stats) - Include process metrics (CPU, memory, file descriptors) Architecture improvement: Move http.Server instantiation outside goroutines to prevent race condition between assignment and shutdown. Test coverage: 5 unit tests validating registry, collectors, and metrics. Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: MatteoMori <morimatteo14@gmail.com>

Use MCPServer.ListTools() to automatically detect which tools each provider registers, eliminating the need to modify individual tool packages. The approach snapshots the tool list before and after each provider's RegisterTools() call, then records the newly added tools in Prometheus with the correct tool_provider label. This means: - Zero changes required in any pkg/ file - Future tools are automatically tracked - No risk of forgetting to add a metric for a new tool Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: MatteoMori <morimatteo14@gmail.com>

Add kagent_tools_mcp_invocations_total and kagent_tools_mcp_invocations_failure_total counters using the wrapper/middleware pattern. All handlers are centrally instrumented in wrapToolHandlersWithMetrics with zero changes to pkg/ files. Update README with Observability section and CLI flags reference. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: MatteoMori <morimatteo14@gmail.com>

Add comprehensive Prometheus Operator integration via Helm chart: - ServiceMonitor resource for automatic target discovery - Dedicated metrics service (kagent-tools-metrics) - Deployment args for --metrics-port configuration - Configurable scrape interval, timeout, and labels Include Grafana dashboard with 8 panels visualizing: - Server version and health metrics - Tool invocation rates by provider - Success/failure rates and trends - Top invoked tools table with heat mapping Add CLAUDE.md with architecture documentation covering: - Tool provider pattern and MCP server lifecycle - Observability architecture (metrics wrapper pattern) - Development commands and key implementation patterns - Helm chart structure and troubleshooting guide Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: MatteoMori <morimatteo14@gmail.com>

Previously --metrics-port defaulted to 8084, causing a mismatch when the server ran on any other port (e.g. E2E tests use port 18190). The metrics server would start on 8084 instead of sharing the main port, so /metrics was unreachable at the expected address. Change the default to 0, resolved at runtime as "same as --port". Update Helm templates to fall back to the main targetPort when tools.metrics.port is unset. Signed-off-by: MatteoMori <morimatteo14@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>

dimetron

@MatteoMori sorry this was in pending status :)

dimetron · 2026-02-17T10:41:00Z

cmd/main.go

+
+				result, err := originalHandler(ctx, req)
+
+				if err != nil {


the failure counter is incremented only when the handler returns a non-nil err, but this codebase’s handlers commonly return mcp.NewToolResultError(...), nil for tool-level failures. That means many failed invocations will never increment kagent_tools_mcp_invocations_failure_total, making the metric misleading.

Good point! Thanks for the feedback. I am happy to have a look at this

dimetron · 2026-02-17T10:44:41Z

CLAUDE.md

Lets exclude CLAUDE.md from repo, .gitignore

I like this PR, @EItanya please add your comments

I actually like the idea of having agents.md files inside of the repo so that all coding agents operating on the repo use similar instructions, but I think we should think through it before we add them.

MatteoMori and others added 4 commits February 12, 2026 19:59

MatteoMori requested review from EItanya and dimetron as code owners February 15, 2026 19:23

MatteoMori force-pushed the observability/prometheus branch from 02aaa2c to 569d744 Compare February 15, 2026 19:26

dimetron requested changes Feb 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): implement Prometheus observability#45

feat(metrics): implement Prometheus observability#45
MatteoMori wants to merge 5 commits intokagent-dev:mainfrom
MatteoMori:observability/prometheus

MatteoMori commented Feb 15, 2026

Uh oh!

dimetron left a comment

Uh oh!

dimetron Feb 17, 2026

Uh oh!

MatteoMori8 Feb 18, 2026

Uh oh!

dimetron Feb 17, 2026

Uh oh!

dimetron Feb 17, 2026

Uh oh!

EItanya Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

MatteoMori commented Feb 15, 2026

What does this PR add?

Uh oh!

dimetron left a comment

Choose a reason for hiding this comment

Uh oh!

dimetron Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

MatteoMori8 Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

dimetron Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

dimetron Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

EItanya Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants