Prometheus is a great way to monitor your Openshift and Kubernetes clusters. With its different exporters you are able to provide all kinds of basic metrics. Nevertheless, as soon as you are heading for highly customized metrics you need to provide the metric yourself what can be quite elaborate.
That is were
checkbot is coming to the rescue. Checkbot is an open-source tool written in Go and is able to run custom bash script in a container running on OpenShift or Kubernetes. The scripts will check functionality, compliance settings or any other customization in your cluster and will expose the result as Prometheus metrics.
Yet Another Tool?
If you monitor Openshift or Kubernetes clusters with Prometheus you probably are using
node-exporter,
kube-state-metrics and
blackbox-exporter. Such exporters will give you detailled insights of what is happening on your cluster. Also check
this list for an overview of existing exporters which can be used to provide metrics from third-party systems for Prometheus. And there are various
client libraries that will help you to expose internal metrics of your application as well.
But as always there is still this other nifty detail which you would like to monitor. Some additional metric that the compliance guys would like to see. Or an aggregation over multiple components which is not achievable using the existing exporters.
Create Your Custom Metrics
To overcome this restrictions the checkbot will provide a simple but very powerful way to generate customized metrics. Based on simple shell scripts the checkbot is able to provide metrics that can be scraped by Prometheus.
Let's say you have the requirement that all projects in your Openshift cluster need to have existing quota (cpu/memory). If for whatever reason a project does not fullfill this rule you want to get notified to be able to take appropriate actions.
The following script is a
basic example of a check that will test if each project in your cluster has a well-defined quota. First it will get all projects available and write that (sorted) list to a file. As a second step it will get all existing quotas in the cluster and write that information to a file as well. Afterwardst a simple compare will provide a list of all projects without quota. In the end the script loops through this list and print out all affected projects in a predefined format.
#!/bin/sh
# ACTIVE true
# TYPE Gauge
# HELP Check if all projects have quotas defined.
# INTERVAL 60
set -eux
# file1 contains all projects
oc get project --no-headers | awk '{print $1}' | sort > /tmp/file1
# file2 contains all quotas
oc get quota --all-namespaces --no-headers | awk '{print $1}' | sort| uniq > /tmp/file2
# result contains projects without quotas
comm -3 /tmp/file1 /tmp/file2 > /tmp/result
# looping through results
while IFS="" read -r p || [ -n "$p" ]
do
printf '1|project=%s\n' "$p"
done < /tmp/result
exit 0
There are some general conventions you must respect but basically you are free to implement any check you can imagine. There are already some tools like curl, jq, oc and awscli available in the container but feel free to add any additional ones. Based on the file name (missing_quota_on_project_total.sh) and some customizable prefix the checkbot will define the name of your metrics, e.g. checkbot_missing_quota_on_project_total.
The execution and the rendering of the metrics is done by the checkbot itself. By using some metadata comments you can control the following:
- ACTIVE: Is the check currently active or not
- TYPE: The type of the metric, (Gauge, Summary, etc.)
- HELP: Description of the metric
- INTERVAL: Number of seconds between two runs of the check
The script must produce a result in a predefined format so that the checkbot is able to generate a valid metric:
value|label1=value1,label2=value2
You must provide two parts here. The first value will be the value of your new checkbot_missing_quota_on_project_total metric. The second part will be treated as labels, a comma-separated list of key/value pairs that you can set to define your metric dimensions. It is possible to have a multi-line result here but make sure that all entries use the same labels.
1|project=grafana
1|project=kube-dns
1|project=test
After each run the checkbot will collect the result of the missing_quota_on_project_total.sh script and stores it as Prometheus metric. You can check the /metrics endpoint of the checkbot to see the result of your script converted to a valid metric:
# HELP checkbot_missing_quota_on_project_total Check if all projects have quotas defined.
# TYPE checkbot_missing_quota_on_project_total gauge
checkbot_missing_quota_on_project_total{project="grafana"} 1
checkbot_missing_quota_on_project_total{project="kube-dns"} 1
checkbot_missing_quota_on_project_total{project="test"} 1
And now
Having your metrics ready you can now configure Prometheus to scrape the checkbot's /metrics endpoint and your custom metrics will show up in Prometheus:
Furthermore you are now able to alert this custom situation via
Alertmanager or display the data nicely on some
Grafana dashboards:
