The Observability Stack
Good observability answers three questions instantly:
1. Is it broken? (Alerting)
2. Where is it broken? (Metrics + Tracing)
3. Why did it break? (Logs)
Prometheus Setup
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'nodejs-app'
static_configs:
- targets: ['app:3000']
metrics_path: '/metrics'
Key Alerts I Use
groups:
- name: production
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate > 5% for 2 minutes"
Grafana Dashboard Tips
Result
End-to-end visibility across 15+ services. MTTR dropped from 45 minutes to under 10 minutes.