APM Implementation Checklist
A comprehensive checklist for implementing application performance monitoring from scratch, covering prerequisites through production rollout.
Last updated: March 2026
Prerequisites
Define service-level objectives (SLOs)
Establish target latency (e.g., P99 < 500ms) and availability (e.g., 99.9%) for each service before instrumenting. SLOs determine which metrics and alerts matter.
Inventory all services and dependencies
Map every service, database, cache, and external API in your architecture. Include async workers and scheduled jobs -- these are frequently missed during instrumentation.
Choose instrumentation approach (auto vs manual)
Auto-instrumentation covers HTTP, gRPC, and database calls with zero code changes. Manual instrumentation adds custom spans for business logic. Most teams need both.
Set up OpenTelemetry Collector or backend
Deploy an OTel Collector as a sidecar or gateway to receive, process, and export telemetry. This decouples your application from the backend vendor.
Establish baseline performance metrics
Record current P50/P95/P99 latency, error rates, and throughput for each service before adding instrumentation. This lets you measure tracing overhead accurately.
SDK Installation
Install OpenTelemetry SDK for primary language
Add the OTel SDK and relevant auto-instrumentation packages. For Go: go.opentelemetry.io/otel. For Node.js: @opentelemetry/sdk-node. For Python: opentelemetry-sdk.
Configure exporter endpoint
Point the OTLP exporter to your Collector or backend. Set OTEL_EXPORTER_OTLP_ENDPOINT environment variable or configure programmatically. Use gRPC for lower overhead.
Set service.name resource attribute
Every service must set a unique service.name via OTEL_SERVICE_NAME or in code. This is the primary grouping key in every observability backend -- get it right early.
Verify first span reaches backend
Send a test request and confirm the trace appears in your backend within 30 seconds. If missing, check collector logs for export errors or dropped spans.
Add SDK dependency to CI/CD pipeline
Pin the OTel SDK version in your dependency file and ensure CI builds pass. OTel SDKs follow semver -- pin minor versions to avoid breaking changes.
Instrumentation
Instrument HTTP/gRPC entry points
Add middleware or interceptors for all inbound requests. This creates root spans with HTTP method, route, status code, and duration automatically.
Add database query tracing
Instrument database drivers to capture query spans with db.system, db.statement (sanitized), and duration. This identifies slow queries in trace waterfalls.
Trace external API calls
Wrap HTTP clients to create spans for outbound calls. Include the target URL, response status, and retry count. Context propagation headers (W3C traceparent) are injected automatically.
Add custom spans for business logic
Create spans around critical business operations (payment processing, order fulfillment, ML inference). Use semantic naming: verb.noun format like process.payment.
Propagate context across async boundaries
Pass trace context explicitly through goroutines, thread pools, and message queues. Without this, async work creates orphan spans that break end-to-end traces.
Alerting and Dashboards
Create service health dashboard
Build a dashboard showing request rate, error rate, and latency percentiles (RED metrics) per service. Include a service map if your backend supports it.
Set latency P95/P99 alerts
Alert when P95 or P99 latency exceeds your SLO threshold for 5+ minutes. Avoid alerting on P50 -- it hides tail latency issues that affect your most important requests.
Set error rate alerts
Alert when error rate exceeds baseline by 2x or crosses an absolute threshold (e.g., 1%). Use a sliding window of 5-10 minutes to avoid noise from single request failures.
Configure on-call notification channel
Route critical alerts to PagerDuty, Opsgenie, or your on-call tool. Set severity levels: P1 pages immediately, P2 notifies within 15 minutes, P3 creates a ticket.
Document runbook links in every alert
Every alert must link to a runbook describing investigation steps, likely causes, and remediation actions. Alerts without runbooks slow down incident response.
Validation and Rollout
Verify end-to-end trace connectivity
Send a request through your full service chain and confirm every hop appears as a span in a single trace. Missing spans indicate broken context propagation.
Load test with tracing enabled
Run a load test matching production traffic patterns and measure CPU/memory overhead from tracing. Overhead should be under 3% -- if higher, enable sampling.
Enable in staging for 1 week
Run with full tracing in staging to catch issues before production. Verify trace data quality, check for missing spans, and confirm alert thresholds are reasonable.
Progressive rollout to production
Roll out tracing gradually: 10% of traffic first, monitor for 24 hours, then 50%, then 100%. Use feature flags or deployment percentage to control the rollout.
Confirm sampling rates appropriate for traffic
High-traffic services (>1000 RPS) should use head-based sampling at 1-10%. Always sample errors and slow requests at 100%. Tail-based sampling is ideal but requires collector support.
Related Resources
Go Distributed Tracing Guide
Implement distributed tracing in Go services with OpenTelemetry
Node.js Distributed Tracing Guide
Add tracing to Node.js applications with automatic instrumentation
Python Distributed Tracing Guide
Instrument Python services for end-to-end trace visibility
Production Monitoring Checklist
Ensure your production environment is fully observable
OpenTelemetry Migration Guide
Migrate from vendor SDKs to OpenTelemetry
Put this into practice
TraceKit gives you distributed tracing, error tracking, and production debugging in one platform. Start free.
Start Free