I’ve been switching my company over to Prometheus, and I’ve come across a few things that need discussion and opinions.
First, concrete advice:
Don’t just write an alert like
expr: sum(rate(bar[5m])) > 5
Write it so you record the rate, and then alert on that metric:
expr: bar:rate > 5
From my Google days, I can say I should probably specify what the time is on that rate.
1) How long should the rate window be? [5m]? [2m]? 3? 10?
* I’ve adopted 5m as standard across my company, being a compromise between being fast-moving and not overly smoothed
2) How long should alert `for`s be?
3) Metric naming
* I’m using `A_Metric_Name`; Not sure if this is right
4) Recorded rule naming
* I like `product:metric[:submetric]:unit` ; eg. houseparty:websockets_open:byDeviceType:sum