Production Monitoring with Prometheus & Grafana — Complete Setup

Gulshan Kumar

5 December 2024

The Observability Stack

Good observability answers three questions instantly:

1. Is it broken? (Alerting)

2. Where is it broken? (Metrics + Tracing)

3. Why did it break? (Logs)

Prometheus Setup


# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'nodejs-app'
    static_configs:
      - targets: ['app:3000']
    metrics_path: '/metrics'

Key Alerts I Use


groups:
  - name: production
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate > 5% for 2 minutes"

Grafana Dashboard Tips

Use variables for service name — one dashboard covers all services

Add SLO panels (target 99.9% uptime = 8.7h downtime/year budget)

Set Slack/PagerDuty alert routing for severity tiers

Result

End-to-end visibility across 15+ services. MTTR dropped from 45 minutes to under 10 minutes.

← Back to Blog ✉️ Discuss this post