title: “Replace OSD”
date: 2020-04-18T08:54:15
slug: replace-osd
Exec in Toolbox:
kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
Useful Commands:
ceph status
ceph osd tree
ceph osd ls (for getting OSD ID)
Remove faulty OSD (in example osd with ID 1):
# ceph osd out osd.1
marked out osd.1.
# ceph osd crush remove osd.1
removed item id 1 name 'osd.1' from crush map
# ceph auth del osd.1
updated
# ceph osd rm osd.1
removed osd.1
Remove Deployment for faulty OSD (will be created again in next step)
# kubectl delete deployment -n rook-ceph rook-ceph-osd-1
deployment.extensions "rook-ceph-osd-1" deleted
Scale Operator Down and up again to detect noes OSDs:
# kubectl scale deployment rook-ceph-operator --replicas=0 -n rook-ceph
deployment.extensions/rook-ceph-operator scaled
# kubectl get pods --all-namespaces -o wide|grep operator
# kubectl scale deployment rook-ceph-operator --replicas=1 -n rook-ceph
deployment.extensions/rook-ceph-operator scaled
# kubectl get pods --all-namespaces -o wide|grep operator
rook-ceph-system rook-ceph-operator-76cf7f88f-g9pxr 0/1 ContainerCreating 0 2s kube-ceph02
New OSD prepare PODs should be created:
kubectl get pods -n rook-ceph -o wide
Check with:
ceph status
ceph osd tree
ceph osd list
In Case of problems:
On OSD Node:
view /var/lib/rook/rook-ceph/log/ceph-volume.log
Check Container logs:
kubectl logs -n rook-ceph rook-ceph-osd-prepare-k8s-node01-hl4rj
Check MGR Container for Recovery:
kubectl logs -f -n rook-ceph rook-ceph-mgr-a-7cb4ccffc6-dnz2q
debug 2020-04-18 08:57:59.352 7ffb18b9b700 0 log\_channel(cluster) log [DBG] : pgmap v940: 64 pgs: 2 active+undersized+degraded+remapped+backfilling, 55 active+clean, 7 active+undersized+degraded+remapped+backfill\_wait; 14 GiB data, 39 GiB used, 708 GiB / 750 GiB avail; 799 KiB/s rd, 5.0 MiB/s wr, 231 op/s; 2648/33588 objects degraded (7.884%); 5.7 MiB/s, 4 objects/s recovering
192.168.1.7 - - [18/Apr/2020:08:58:01] "GET / HTTP/1.1" 200 155 "" "kube-probe/1.18"
debug 2020-04-18 08:58:01.352 7ffb18b9b700 0 log\_channel(cluster) log [DBG] : pgmap v942: 64 pgs: 2 active+undersized+degraded+remapped+backfilling, 55 active+clean, 7 active+undersized+degraded+remapped+backfill\_wait; 14 GiB data, 39 GiB used, 708 GiB / 750 GiB avail; 738 KiB/s rd, 6.0 MiB/s wr, 270 op/s; 2648/33588 objects degraded (7.884%); 7.1 MiB/s, 5 objects/s recovering
