Explanation of replica, count and worker scaling for Red Hat OpenShift Container Storage 4.x

title: “Explanation of replica, count and worker scaling for Red Hat OpenShift Container Storage 4.x”
date: 2021-01-20T17:23:57
slug: explanation-of-replica-count-and-worker-scaling-for-red-hat-openshift-container-storage-4-x

Umgebung

Red Hat OpenShift Container Storage 4.x

Problembeschreibung

Explanation of replica, count and worker scaling for Red Hat OpenShift Container Storage 4.x

When creating the StorageCluster kind object with name ocs-storagecluster to deploy OpenShift Container Storage (OCS), administrators can set spec.storageDeviceSets[0].count and spec.storageDeviceSets[0].replica. What should the values of these fields be set to?

For example:

Raw

cat <<'EOF' | oc apply -f -
apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
 name: ocs-storagecluster
 namespace: openshift-storage
spec:
 manageNodes: false
 resources:
 mds:
 limits:
 cpu: 3
 memory: 8Gi
 requests:
 cpu: 1
 memory: 8Gi
 noobaa-core:
 limits:
 cpu: 2
 memory: 8Gi
 requests:
 cpu: 1
 memory: 8Gi
 noobaa-db:
 limits:
 cpu: 2
 memory: 8Gi
 requests:
 cpu: 1
 memory: 8Gi
 monDataDirHostPath: /var/lib/rook
 storageDeviceSets:
 - count: 2
 dataPVCTemplate:
 spec:
 accessModes:
 - ReadWriteOnce
 resources:
 requests:
 storage: 2328Gi
 storageClassName: localblock
 volumeMode: Block
 name: ocs-deviceset
 placement: {}
 portable: false
 replica: 3
 resources: {}
EOF

Lösung

Rules for node/disk scaling

The only supported value for replica: is 3.
OCS worker nodes must be scaled out in multiples of 3. Meaning that 3,6,9,… OCS worker nodes are o.k., but for example 4 or 5 are not.
Each OCS worker node must have the same number [number of OCS PVs per worker] of equally sized PVs that can be used for OCS
count: must be set to [number of OCS worker nodes] * [number of OCS PVs per worker] / 3 (the last 3 is explained by [replica size = 3]

Grundursache

The reason for this lies in the following

The pools’ replicated size is hard coded to 3 with OCS. If the rack count is not divisible by 3, then the count: parameter must be a multiple of 3. The total number of OSDs needs to be divisible by 3.
One could in theory change the rack count by changing the replica: parameter, but this negatively influences scale-outs, forcing administrators to add node multiples of rack count (= replica: parameter) at every single scale-out.
One can only add node capacity in multiples of rack count. Rack count is set by the replica: parameter. The ideal rack count is 3, as this is the smallest supported unit one can have with Ceph due to the pools’ replicated size of 3.

When looking at ocs-deviceset-x-y, replica: controls max x. And count: controls max y.

Each storageDeviceSet will be tied to a specific rack according to replica::

Raw

$ oc get jobs -o name | while read job; do echo === $job === ; oc get $job -o yaml | egrep 'rack[0-9]+' ; done
=== job.batch/rook-ceph-osd-prepare-ocs-deviceset-0-0-nd5jp ===
 - rack0
=== job.batch/rook-ceph-osd-prepare-ocs-deviceset-0-1-ns9jx ===
 - rack0
=== job.batch/rook-ceph-osd-prepare-ocs-deviceset-1-0-qbf59 ===
 - rack1
=== job.batch/rook-ceph-osd-prepare-ocs-deviceset-1-1-j7bs8 ===
 - rack1
=== job.batch/rook-ceph-osd-prepare-ocs-deviceset-2-0-jwshs ===
 - rack2
=== job.batch/rook-ceph-osd-prepare-ocs-deviceset-2-1-49ldb ===
 - rack2

Each node will be evenly distributed into a specific rack:

Raw

$ oc get nodes -l topology.rook.io/rack=rack2
NAME STATUS ROLES AGE VERSION
ip-10-0-202-249.eu-west-1.compute.internal Ready worker 8d v1.17.1+b83bc57
$ oc get nodes -l topology.rook.io/rack=rack0
NAME STATUS ROLES AGE VERSION
ip-10-0-198-152.eu-west-1.compute.internal Ready worker 8d v1.17.1+b83bc57
$ oc get nodes -l topology.rook.io/rack=rack1
NAME STATUS ROLES AGE VERSION
ip-10-0-197-77.eu-west-1.compute.internal Ready worker 8d v1.17.1+b83bc57
$ oc get nodes -l topology.rook.io/rack=rack2
NAME STATUS ROLES AGE VERSION
ip-10-0-202-249.eu-west-1.compute.internal Ready worker 8d v1.17.1+b83bc57

When labeling nodes, nodes will be added to the racks. For example, when adding:

Raw

oc label node/ip-10-0-210-157.eu-west-1.compute.internal cluster.ocs.openshift.io/openshift-storage=''

The node will be added to rack0:

Raw

$ oc get nodes -l topology.rook.io/rack=rack0
NAME STATUS ROLES AGE VERSION
ip-10-0-198-152.eu-west-1.compute.internal Ready worker 8d v1.17.1+b83bc57
ip-10-0-210-157.eu-west-1.compute.internal Ready worker 8d v1.17.1+b83bc57
$ oc get nodes -l topology.rook.io/rack=rack1
NAME STATUS ROLES AGE VERSION
ip-10-0-197-77.eu-west-1.compute.internal Ready worker 8d v1.17.1+b83bc57
$ oc get nodes -l topology.rook.io/rack=rack2
NAME STATUS ROLES AGE VERSION
ip-10-0-202-249.eu-west-1.compute.internal Ready worker 8d v1.17.1+b83bc57

In the above example, 2 more nodes must added for a supported configuration in this case, until each rack shows the same number of nodes.

When increasing count: by 1, one new OSD will be created in each rack, adding a total of 3 OSDs.

If the OCS worker node count cannot be divided by 3, then OCS cannot create all OSDs for the scale-out. The prepare OSD job’s pod will remain in Pending with a message similar to the following:

Raw

$ oc get pods | grep rook-ceph-osd-prepare-ocs-deviceset-2-2-vp5m4-2gmkv
rook-ceph-osd-prepare-ocs-deviceset-2-2-vp5m4-2gmkv 0/1 Pending 0 32m
$ oc describe pod rook-ceph-osd-prepare-ocs-deviceset-2-2-vp5m4-2gmkv | tail -1
 Warning FailedScheduling 22s (x29 over 32m) default-scheduler 0/10 nodes are available: 1 node(s) didn't find available persistent volumes to bind, 9 node(s) didn't match node selector.