Openshift Monitoring Guide

OpenShift Monitoring is an ever evolving problem space, with many layers, approaches, and complexities. We attempt to unpack them here.

Overview

Note
Before reading this guide, we recommend first reading An Overview of OpenShift for System Admins.

The following document intends to provide starting guidance on how to build a monitoring approach for OpenShift. In it, we will present a suggested categorization of checks that would form the basis for items that should usually generate an alert of some kind in a production cluster.

This document will not propose any specific tooling for a monitoring approach, but instead presents the data points that are believed to be important when designing a cluster monitoring approach and provides examples of the logic involved in creating alerts.

Note
Discussion of tooling and building a monitoring stack implementation will be discussed in a future guide.

Each section below will present a table of data, giving a description of the data point, and a sample command that might trigger an alert.

Ensuring a cluster is healthy

Docker

Docker is an essential component of an OpenShift environment. The overall health of docker on each master and node instance ensures stability within an OpenShift cluster. The following components are areas that should be monitored on each node.

Table 1. Docker checks
Check Name Description Storage Driver Sample Alerting Logic

Docker Daemon

Check that docker is running on a system

devicemapper

systemctl is-active docker

overlay2

systemctl is-active docker

Docker Storage

Check that docker’s storage has adequate space. overlay2 check assumes LV_Name is dockerlv and VG is dockervg.

devicemapper

echo $(echo \"$(docker info 2>/dev/null | awk '/Data Space Available/ {print $4}') / $(docker info 2>/dev/null | awk '/Data Space Total/ {print $4}')\" | bc -l) '>' 0.3 | bc -l

overlay2

echo "$(df -h | awk '/dockervg-dockerlv/ {print $5}' | awk -F% '{print $1}') > 70" | bc

Docker Metadata Storage

Check that docker’s metadata storage volume is not full

devicemapper

echo $(echo \"$(docker info 2>/dev/null | awk '/Metadata Space Available/ {print $4}') / $(docker info 2>/dev/null | awk '/Metadata Space Total/ {print $4}')\" | bc -l) '>' 0.3 | bc -l

overlay2

N/A with overlay2

Nodes & Masters

Table 2. Node,Master checks
Check Name Description Relevant Hosts OCP Version Sample Alerting Logic

Etcd Service

Check that etcd is active

Masters

<= 3.9

systemctl is-active etcd

>= 3.10

oc get pods -n kube-system --no-headers -o=custom-columns=POD:.metadata.name,STATUS:.status.phase | grep -i "master-etcd" | grep -i "running" | if [ $( wc -l) -eq $(oc get pods -n kube-system --no-headers -o=custom-columns=POD:.metadata.name | grep etcd | wc -l) ]; then exit 0; else exit 1; fi

Etcd Storage

Check that the etcd volume is not too full. This checks assumes the node storage (/var/lib/etcd) is provisioned with a separate logical volume.

Masters

<= 3.9

echo "$(lvs | awk '/etcd/ {print $4}') > 70" | bc or echo "$(df -h | awk '/etcd/ {print $5}' | awk -F% '{print $1}') > 70" | bc

>= 3.10

echo "$(lvs | awk '/etcd/ {print $4}') > 70" | bc or echo "$(df -h | awk '/etcd/ {print $5}' | awk -F% '{print $1}') > 70" | bc

Master API Service (single master)

Check that the Master API Service or pods are active

Masters

<= 3.9

systemctl is-active atomic-openshift-master

>= 3.10

Same as multi-master check.

Master API Service (multi-master)

Check that the Master API Service or pods are active

Masters

<= 3.9

systemctl is-active atomic-openshift-master-api

>= 3.10

oc get pods -n kube-system --no-headers -o=custom-columns=POD:.metadata.name,STATUS:.status.phase | grep -i "master-api" | grep -i "running" | if [ $( wc -l) -eq $(oc get pods -n kube-system --no-headers -o=custom-columns=POD:.metadata.name | grep etcd | wc -l) ]; then exit 0; else exit 1; fi

Master Controllers Service (multi-master)

Check that the Master Controllers Service or pods are active

Masters

<= 3.9

systemctl is-active atomic-openshift-master-controllers

>= 3.10

oc get pods -n kube-system --no-headers -o=custom-columns=POD:.metadata.name,STATUS:.status.phase | grep -i "master-controller" | grep -i "running" | if [ $( wc -l) -eq $(oc get pods -n kube-system --no-headers -o=custom-columns=POD:.metadata.name | grep etcd | wc -l) ]; then exit 0; else exit 1; fi

Node Service

Check that the node service is active

All Nodes

<= 3.9

systemctl is-active atomic-openshift-node

>= 3.10

systemctl is-active atomic-openshift-node

Node Storage

Check that the node’s local data storage volume is not too full. This checks assumes the node storage (/var/lib/origin) is provisioned with a separate logical volume.

All Nodes

<= 3.9

echo "$(lvs | awk '/origin/ {print $4}') > 70" | bc or echo "$(df -h | awk '/origin/ {print $5}' | awk -F% '{print $1}') > 70" | bc

>= 3.10

echo "$(lvs | awk '/origin/ {print $4}') > 70" | bc or echo "$(df -h | awk '/origin/ {print $5}' | awk -F% '{print $1}') > 70" | bc

OpenVSwitch Service

Check that the openvswitch service or pods are active

All Nodes

<= 3.9

systemctl is-active openvswitch

>= 3.10

oc get pods -n openshift-sdn --no-headers -o=custom-columns=POD:.metadata.name,STATUS:.status.phase | grep -i "ovs-" | grep -i "running" | if [ $( wc -l) -eq $(oc get nodes --no-headers | wc -l) ]; then exit 0; else exit 1; fi

SDN Service

Check that all the SDN pods are active

All Nodes

<= 3.9

NA

>= 3.10

oc get pods -n openshift-sdn --no-headers -o=custom-columns=POD:.metadata.name,STATUS:.status.phase | grep -i "sdn-" | grep -i "running" | if [ $( wc -l) -eq $(oc get nodes --no-headers | wc -l) ]; then exit 0; else exit 1; fi

API Endpoints

Many OpenShift components expose HTTP based endpoints for interrogating the health and current operation. The following endpoints should be monitored.

Table 3. API Endpoint checks
Check Name Description Sample Alerting Logic

OpenShift Master API Server

Check the health of a master API Endpoint

curl -s https://console.c1-ocp.myorg.com:8443/healthz | grep ok

Router

Check the health of the Router

curl http://router.default.svc.cluster.local:1936/healthz | grep 200

Registry

Check the health of the Registry

curl -I https://docker-registry.default.svc.cluster.local:5000/healthz | grep 200

Logging

Check the health of the EFK Logging Stack

Because of the various components and complexities involved, we recommend the OpenShift Logging health check script.

Metrics

Check the health of the Metrics Stack

Because of the various components and complexities involved, we recommend the OpenShift Metrics health check script.

Ensuring a cluster has adequate capacity

The OpenShift Blog has published an excellent blog series that addresses the issue of cluster capacity management.

The first post in this series addresses the inner workings of Quotas, Requests, and Limits, and how they work together to provide information on cluster capacity.

The second post dives into how OpenShift deals with resource overcomitment. It includes guidance on how to properly protect nodes from issues related to resources strain.

The third post about cluster capacity really gets into estimating workloads, measuring how accurate your resources estimates are, and finally how to properly size your cluster based on the first two.

What’s Next?