Skip to content

Expose UFFD pager stats as Prometheus metrics#297

Draft
jarugupj wants to merge 4 commits into
mainfrom
hypeship/uffd-pager-metrics
Draft

Expose UFFD pager stats as Prometheus metrics#297
jarugupj wants to merge 4 commits into
mainfrom
hypeship/uffd-pager-metrics

Conversation

@jarugupj

@jarugupj jarugupj commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add an opt-in TCP `/metrics` Prometheus endpoint to `hypeman-uffd-pager`, wired through the existing `lib/otel` OTel→Prometheus bridge already used by `hypeman-api`.
  • Instruments are observable, backed by callbacks reading the pager's existing atomic counters via `SnapshotStats()` / `SnapshotTimingStats()`. No new state, no changes to counter increment sites.
  • The JSON `/stats` handler is refactored to share the same snapshot via a new `(*server).stats()` method, so the JSON and Prometheus outputs stay in sync.
  • Every observation carries a `version_key` attribute so multiple pager versions running side-by-side stay distinguishable in queries.

New flags on `hypeman-uffd-pager` (all opt-in, empty = disabled):

  • `--metrics-addr` — TCP address for Prometheus `/metrics`
  • `--otel-endpoint` — OTLP push endpoint (pull remains available independently)
  • `--otel-insecure` — OTLP transport option
  • `--otel-metric-export-interval` — OTLP push interval

Metrics exposed

Counters: `hypeman_uffd_{cache_hits,cache_misses,faults,backing_bytes_read,copies,copy_errors}_total`, plus `_nanos_total` accumulators for cache lookup / add / fault / read-page / backing-read / copy latencies.

Gauges: `hypeman_uffd_{cache_bytes,cache_max_bytes,cache_items,cache_shards,active_sessions,active_faults,max_concurrent_faults,draining}`, plus high-water `_max_nanos` gauges for the same latencies.

Test plan

  • `GOOS=linux go build ./cmd/uffd-pager/ ./lib/uffdpager/`
  • `GOOS=linux go vet ./cmd/uffd-pager/ ./lib/uffdpager/`
  • `GOOS=linux go test ./lib/uffdpager/...` — includes new `TestRegisterMetricsObservesStats` and `TestRegisterMetricsNilMeter`
  • Deploy to `dev-yul-hypeman-0`, curl `http://127.0.0.1:/metrics`, verify Prometheus text and non-zero values on a running pager
  • Confirm metrics land in SigNoz after the corresponding infra PR ships the otel-collector scrape config

jarugupj and others added 4 commits July 1, 2026 19:43
Wire the pager's existing atomic counters into hypeman's OTel Prometheus
bridge so otel-collector can scrape them alongside the other host
exporters. The JSON /stats endpoint over the control unix socket is
unchanged; a new opt-in TCP /metrics endpoint reads the same snapshot.

--metrics-addr enables the pull endpoint. --otel-endpoint enables OTLP
push. Both are opt-in so existing deployments keep working.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pair with the new --metrics-addr flag: pass through an env var so
per-instance overrides written via EnvironmentFile can opt in without
editing the base unit. Empty default keeps the metrics server off for
hosts that haven't been configured.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Required by scripts/check-uffd-version.sh whenever runtime files under
lib/uffdpager change. The new metrics endpoint is additive and does not
alter the UFFD wire protocol, but the pre-commit contract asks for an
explicit version bump on any pager code change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant