Openshift Monitoring Guide

OpenShift Monitoring is an ever evolving problem space, with many layers, approaches, and complexities. We attempt to unpack them here.

Overview

Note
Before reading this guide, we recommend first reading An Overview of OpenShift for System Admins.

The following document intends to provide starting guidance on how to build a monitoring approach for OpenShift. In it, we will present a suggested categorization of checks that would form the basis for items that should usually generate an alert of some kind in a production cluster.

This document will not propose any specific tooling for a monitoring approach, but instead presents the data points that are believed to be important when designing a cluster monitoring approach and provides examples of the logic involved in creating alerts.

Note
Discussion of tooling and building a monitoring stack implementation will be discussed in a future guide.

Each section below will present a table of data, giving a description of the data point, and a sample command that might trigger an alert.

Ensuring a cluster is healthy

Docker

Docker is an essential component of an OpenShift environment. The overall health of docker on each master and node instance ensures stability within an OpenShift cluster. The following components are areas that should be monitored on each node.

Table 1. Docker checks
Check Name Description Sample Alerting Logic

Docker Daemon

Check that docker is running on a system

systemctl is-active docker

Docker Storage

Check that docker’s storage has adequate space

echo $(echo \"$(docker info 2>/dev/null | awk '/Data Space Available/ {print $4}') / $(docker info 2>/dev/null | awk '/Data Space Total/ {print $4}')\" | bc -l) '>' 0.3 | bc -l

Docker Metadata Storage

Check that docker’s metadata storage volume is not full

echo $(echo \"$(docker info 2>/dev/null | awk '/Metadata Space Available/ {print $4}') / $(docker info 2>/dev/null | awk '/Metadata Space Total/ {print $4}')\" | bc -l) '>' 0.3 | bc -l

Nodes & Masters

Table 2. Node,Master checks
Check Name Description Relevant Hosts Sample Alerting Logic

Etcd Service

Check that etcd is active

Masters

systemctl is-active etcd

Etcd Storage

Check that the etcd volume is not too full

Masters

echo "$(lvs | awk '/etcd/ {print $5}') > 70" | bc

Master API Service (single master)

Check that the Master API Service is active

Masters

systemctl is-active atomic-openshift-master

Master API Service (multi-master)

Check that the Master API Service is active

Masters

systemctl is-active atomic-openshift-master-api

Master Controllers Service (multi-master)

Check that the Master Controllers Service is active

Masters

systemctl is-active atomic-openshift-master-controllers

Node Service

Check that the node service is active

All Nodes

systemctl is-active atomic-openshift-node

Node Storage

Check that the node’s local data storage volume is not too full

All Nodes

echo "$(lvs | awk '/origin/ {print $5}') > 70" | bc

OpenVSwitch Service

Check that the openvswitch service is active

All Nodes

systemctl is-active openvswitch

API Endpoints

Many OpenShift components expose HTTP based endpoints for interrogating the health and current operation. The following endpoints should be monitored.

Table 3. API Endpoint checks
Check Name Description Sample Alerting Logic

OpenShift Master API Server

Check the health of a master API Endpoint

curl -H "Authorization: Bearer $(oc whoami -t)" https://console.c1-ocp.myorg.com:8443/healthz | grep ok

Router

Check the health of the Router

curl http://router.default.svc.cluster.local:1936/healthz | grep 200

Registry

Check the health of the Registry

curl -I https://docker-registry.default.svc.cluster.local:5000/healthz | grep 200

Logging

Check the health of the EFK Logging Stack

Because of the various components and complexities involved, we recommend the OpenShift Logging health check script.

Metrics

Check the health of the Metrics Stack

Because of the various components and complexities involved, we recommend the OpenShift Metrics health check script.

Ensuring a cluster has adequate capacity

The OpenShift Blog has published an excellent blog series that addresses the issue of cluster capacity management.

The first post in this series addresses the inner workings of Quotas, Requests, and Limits, and how they work together to provide information on cluster capacity.

The second post dives into how OpenShift deals with resource overcomitment. It includes guidance on how to properly protect nodes from issues related to resources strain.

The third post about cluster capacity really gets into estimating workloads, measuring how accurate your resources estimates are, and finally how to properly size your cluster based on the first two.

What’s Next?