title: “Explanation of replica, count and worker scaling for Red Hat OpenShift Container Storage 4.x”
date: 2021-01-20T17:23:57
slug: explanation-of-replica-count-and-worker-scaling-for-red-hat-openshift-container-storage-4-x
Umgebung
Red Hat OpenShift Container Storage 4.x
Problembeschreibung
Explanation of replica, count and worker scaling for Red Hat OpenShift Container Storage 4.x
When creating the StorageCluster kind object with name ocs-storagecluster to deploy OpenShift Container Storage (OCS), administrators can set spec.storageDeviceSets[0].count and spec.storageDeviceSets[0].replica. What should the values of these fields be set to?
For example:
cat <<'EOF' | oc apply -f -
apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
name: ocs-storagecluster
namespace: openshift-storage
spec:
manageNodes: false
resources:
mds:
limits:
cpu: 3
memory: 8Gi
requests:
cpu: 1
memory: 8Gi
noobaa-core:
limits:
cpu: 2
memory: 8Gi
requests:
cpu: 1
memory: 8Gi
noobaa-db:
limits:
cpu: 2
memory: 8Gi
requests:
cpu: 1
memory: 8Gi
monDataDirHostPath: /var/lib/rook
storageDeviceSets:
- count: 2
dataPVCTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2328Gi
storageClassName: localblock
volumeMode: Block
name: ocs-deviceset
placement: {}
portable: false
replica: 3
resources: {}
EOF
Lösung
Rules for node/disk scaling
-
The only supported value for
replica:is3. -
OCS worker nodes must be scaled out in multiples of
3. Meaning that 3,6,9,… OCS worker nodes are o.k., but for example 4 or 5 are not. -
Each OCS worker node must have the same number
[number of OCS PVs per worker]of equally sized PVs that can be used for OCS -
count:must be set to[number of OCS worker nodes] * [number of OCS PVs per worker] / 3(the last3is explained by[replica size = 3]
Grundursache
The reason for this lies in the following
-
The pools’
replicated sizeis hard coded to 3 with OCS. If the rack count is not divisible by3, then thecount:parameter must be a multiple of 3. The total number of OSDs needs to be divisible by3. -
One could in theory change the rack count by changing the
replica:parameter, but this negatively influences scale-outs, forcing administrators to add node multiples of rack count (=replica:parameter) at every single scale-out. -
One can only add node capacity in multiples of rack count. Rack count is set by the
replica:parameter. The ideal rack count is3, as this is the smallest supported unit one can have with Ceph due to the pools’replicated sizeof3.
When looking at ocs-deviceset-x-y, replica: controls max x. And count: controls max y.
Each storageDeviceSet will be tied to a specific rack according to replica::
$ oc get jobs -o name | while read job; do echo === $job === ; oc get $job -o yaml | egrep 'rack[0-9]+' ; done
=== job.batch/rook-ceph-osd-prepare-ocs-deviceset-0-0-nd5jp ===
- rack0
=== job.batch/rook-ceph-osd-prepare-ocs-deviceset-0-1-ns9jx ===
- rack0
=== job.batch/rook-ceph-osd-prepare-ocs-deviceset-1-0-qbf59 ===
- rack1
=== job.batch/rook-ceph-osd-prepare-ocs-deviceset-1-1-j7bs8 ===
- rack1
=== job.batch/rook-ceph-osd-prepare-ocs-deviceset-2-0-jwshs ===
- rack2
=== job.batch/rook-ceph-osd-prepare-ocs-deviceset-2-1-49ldb ===
- rack2
Each node will be evenly distributed into a specific rack:
$ oc get nodes -l topology.rook.io/rack=rack2
NAME STATUS ROLES AGE VERSION
ip-10-0-202-249.eu-west-1.compute.internal Ready worker 8d v1.17.1+b83bc57
$ oc get nodes -l topology.rook.io/rack=rack0
NAME STATUS ROLES AGE VERSION
ip-10-0-198-152.eu-west-1.compute.internal Ready worker 8d v1.17.1+b83bc57
$ oc get nodes -l topology.rook.io/rack=rack1
NAME STATUS ROLES AGE VERSION
ip-10-0-197-77.eu-west-1.compute.internal Ready worker 8d v1.17.1+b83bc57
$ oc get nodes -l topology.rook.io/rack=rack2
NAME STATUS ROLES AGE VERSION
ip-10-0-202-249.eu-west-1.compute.internal Ready worker 8d v1.17.1+b83bc57
When labeling nodes, nodes will be added to the racks. For example, when adding:
oc label node/ip-10-0-210-157.eu-west-1.compute.internal cluster.ocs.openshift.io/openshift-storage=''
The node will be added to rack0:
$ oc get nodes -l topology.rook.io/rack=rack0
NAME STATUS ROLES AGE VERSION
ip-10-0-198-152.eu-west-1.compute.internal Ready worker 8d v1.17.1+b83bc57
ip-10-0-210-157.eu-west-1.compute.internal Ready worker 8d v1.17.1+b83bc57
$ oc get nodes -l topology.rook.io/rack=rack1
NAME STATUS ROLES AGE VERSION
ip-10-0-197-77.eu-west-1.compute.internal Ready worker 8d v1.17.1+b83bc57
$ oc get nodes -l topology.rook.io/rack=rack2
NAME STATUS ROLES AGE VERSION
ip-10-0-202-249.eu-west-1.compute.internal Ready worker 8d v1.17.1+b83bc57
In the above example, 2 more nodes must added for a supported configuration in this case, until each rack shows the same number of nodes.
When increasing count: by 1, one new OSD will be created in each rack, adding a total of 3 OSDs.
If the OCS worker node count cannot be divided by 3, then OCS cannot create all OSDs for the scale-out. The prepare OSD job’s pod will remain in Pending with a message similar to the following:
$ oc get pods | grep rook-ceph-osd-prepare-ocs-deviceset-2-2-vp5m4-2gmkv
rook-ceph-osd-prepare-ocs-deviceset-2-2-vp5m4-2gmkv 0/1 Pending 0 32m
$ oc describe pod rook-ceph-osd-prepare-ocs-deviceset-2-2-vp5m4-2gmkv | tail -1
Warning FailedScheduling 22s (x29 over 32m) default-scheduler 0/10 nodes are available: 1 node(s) didn't find available persistent volumes to bind, 9 node(s) didn't match node selector.
