Donnerstag, 23. Januar 2020

List stucked pods in OpenShift or Kubernetes

As OpenShift or Kubernetes cluster administrator you probably know how to list all pods in your cluster using

oc get pods --all-namespaces

or

kubctl get pods --all-namespaces

This command will return a list of all pods which currently exist in your cluster. This list can be quite long and in case of troubleshooting way too verbose as most of the pods are in Running state and do not have any issues.

By using grep you can easily filter out pods with certain status and suffixes:

oc get pods --all-namespaces | grep -v -e "Running" -e "Completed" -e "-build" -e "-deploy"

In this example, you specify some regex patterns using the -e flag. The -v flag will then invert the matching and return all pods that potentially have a problem. Using an alias e.g. list-stuck-pods will give you a nice command to troubleshoot your pods.

Freitag, 13. Dezember 2019

Highly Customized Metrics For Your Prometheus

Prometheus is a great way to monitor your Openshift and Kubernetes clusters. With its different exporters you are able to provide all kinds of basic metrics. Nevertheless, as soon as you are heading for highly customized metrics you need to provide the metric yourself what can be quite elaborate.

That is were checkbot is coming to the rescue. Checkbot is an open-source tool written in Go and is able to run custom bash script in a container running on OpenShift or Kubernetes. The scripts will check functionality, compliance settings or any other customization in your cluster and will expose the result as Prometheus metrics.

Yet Another Tool?

If you monitor Openshift or Kubernetes clusters with Prometheus you probably are using node-exporter, kube-state-metrics and blackbox-exporter. Such exporters will give you detailled insights of what is happening on your cluster. Also check this list for an overview of existing exporters which can be used to provide metrics from third-party systems for Prometheus. And there are various client libraries that will help you to expose internal metrics of your application as well.

But as always there is still this other nifty detail which you would like to monitor. Some additional metric that the compliance guys would like to see. Or an aggregation over multiple components which is not achievable using the existing exporters.

Create Your Custom Metrics

To overcome this restrictions the checkbot will provide a simple but very powerful way to generate customized metrics. Based on simple shell scripts the checkbot is able to provide metrics that can be scraped by Prometheus.



Let's say you have the requirement that all projects in your Openshift cluster need to have existing quota (cpu/memory). If for whatever reason a project does not fullfill this rule you want to get notified to be able to take appropriate actions.

The following script is a basic example of a check that will test if each project in your cluster has a well-defined quota. First it will get all projects available and write that (sorted) list to a file. As a second step it will get all existing quotas in the cluster and write that information to a file as well. Afterwardst a simple compare will provide a list of all projects without quota. In the end the script loops through this list and print out all affected projects in a predefined format.

#!/bin/sh

# ACTIVE true
# TYPE Gauge
# HELP Check if all projects have quotas defined.
# INTERVAL 60

set -eux

# file1 contains all projects
oc get project --no-headers | awk '{print $1}' | sort > /tmp/file1

# file2 contains all quotas
oc get quota --all-namespaces --no-headers | awk '{print $1}' | sort| uniq > /tmp/file2

# result contains projects without quotas
comm -3 /tmp/file1 /tmp/file2 > /tmp/result

# looping through results
while IFS="" read -r p || [ -n "$p" ]
do
printf '1|project=%s\n' "$p"
done < /tmp/result

exit 0


There are some general conventions you must respect but basically you are free to implement any check you can imagine. There are already some tools like curl, jq, oc and awscli available in the container but feel free to add any additional ones. Based on the file name (missing_quota_on_project_total.sh) and some customizable prefix the checkbot will define the name of your metrics, e.g. checkbot_missing_quota_on_project_total.

The execution and the rendering of the metrics is done by the checkbot itself. By using some metadata comments you can control the following:
  • ACTIVE: Is the check currently active or not
  • TYPE: The type of the metric, (Gauge, Summary, etc.)
  • HELP: Description of the metric
  • INTERVAL: Number of seconds between two runs of the check
The script must produce a result in a predefined format so that the checkbot is able to generate a valid metric:

value|label1=value1,label2=value2

You must provide two parts here. The first value will be the value of your new checkbot_missing_quota_on_project_total metric. The second part will be treated as labels, a comma-separated list of key/value pairs that you can set to define your metric dimensions. It is possible to have a multi-line result here but make sure that all entries use the same labels.

1|project=grafana
1|project=kube-dns
1|project=test

After each run the checkbot will collect the result of the missing_quota_on_project_total.sh script and stores it as Prometheus metric. You can check the /metrics endpoint of the checkbot to see the result of your script converted to a valid metric:

# HELP checkbot_missing_quota_on_project_total Check if all projects have quotas defined.
# TYPE checkbot_missing_quota_on_project_total gauge
checkbot_missing_quota_on_project_total{project="grafana"} 1
checkbot_missing_quota_on_project_total{project="kube-dns"} 1
checkbot_missing_quota_on_project_total{project="test"} 1


And now

Having your metrics ready you can now configure Prometheus to scrape the checkbot's /metrics endpoint and your custom metrics will show up in Prometheus:



Furthermore you are now able to alert this custom situation via Alertmanager or display the data nicely on some Grafana dashboards: