This document describes the strucutre of our ceph cluster as of May 12, 2021. ## Ceph The ceph cluster used for our OpenStack, OpenShift and baremetal environments is made up of 10 OSD servers and 3 monitors. Each OSD server has 10 HDDs and 1 nVme drive. The HDDs make up the "size" root, while the nVme drives make the "performance" root. Everything is triple replicated except RGW storage pools. ## How do I login? 1. Login to the monitors over the foreman network (172.16.0.0/19) This works if you have access to the MOC VPN or a gateway like `kzn-ipmi-gw.infra.massopen.cloud`. | Host | IP Address | | ---- | ---------- | | kzn-mon1 | 172.16.19.15 | | kzn-mon2 | 172.16.17.14 | | kzn-mon3 | 172.16.5.14 | From the monitors you can run various ceph commands. Look into /etc/hosts to SSH to the osd servers and RGWs. 2. Login to the public RGW that we use for OpenStack Swift. | Host | IP Address | | ---- | ---------- | | kzn-rgw1.infra.massopen.cloud | 128.31.24.16 | | kzn-rgw2.infra.massopen.cloud | 128.31.24.17 | | kzn-swift.infra.massopen.cloud | 128.31.24.18 | The swift endpoint will take you to 1 of the gateways. From there look into the hosts file and then you can get to the monitors. ## Server Configuration All our OSD servers are Dell PowerEdge R730xd. Here's the configuration of an OSD server. | Field | Value | | ------------------ | ------ | | CPU(s) | 40 | | Thread(s) per core | 2 | | Core(s) per socket |10 | | Socket(s) | 2 | | Model Name | Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz | | RAM | 128 GB DDR3 | | Model | PowerEdge R730xd | | Networking | Dual 10GB Nics | | Storage | 10 X 10TB 7.2k rpm drives for "size" root | | Storage | 1 x 8TB PCIe nVme drive for "performance" root| | Storage | 2 X 128GB SSDs for OS | We have 3 Dell PowerEdge R330. Each server acts as a monitor (mon), manager (mgr) and metadata (mds) server. Here's the configuration of those servers: | Field | Value | | ------------------ | ------ | | CPU(s) | 4 | | Thread(s) per core | 1 | | Core(s) per socket | 4 | | Socket(s) | 1 | | Model Name | Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz | | RAM | 64 GB DDR3 | | Model | PowerEdge R330 | | Networking | Dual 10GB Nics | | Storage | 2 X 128GB SSDs for OS | The Rados Gateways are virtual machines on our RHEV/ovirt cluster. The test iSCSI server is a VM on the BMI intel node. Here's the output of ceph status which shows various ceph services. ``` [root@kzn-mon1 ~]# ceph -s cluster: id: ed141a90-e501-481c-90c2-b9e24a7c9b54 health: HEALTH_OK services: mon: 3 daemons, quorum kzn-mon1,kzn-mon2,kzn-mon3 (age 6d) mgr: kzn-mon1(active, since 6d), standbys: kzn-mon2, kzn-mon3 mds: cephfs:1 {0=kzn-mon1=up:active} 2 up:standby osd: 130 osds: 129 up (since 5d), 129 in (since 8d) rgw: 4 daemons active (kzn-rgw-j-01, kzn-rgw-j-02, kzn-rgw1, kzn-rgw2) tcmu-runner: 1 daemon active (kzn-vbmi01stack.kzn.moc:m2/ceph-iscsi-1) task status: scrub status: mds.kzn-mon1: idle data: pools: 22 pools, 2000 pgs objects: 50.08M objects, 76 TiB usage: 223 TiB used, 910 TiB / 1.1 PiB avail pgs: 1995 active+clean 5 active+clean+scrubbing+deep io: client: 43 MiB/s rd, 417 MiB/s wr, 781 op/s rd, 3.12k op/s wr ``` ## OSD tree Here's the output of `ceph osd tree`. ``` [root@kzn-mon1 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1002 72.99988 root performance -35 7.29999 host kzn-osd05-nvme 272 ssd 7.29999 osd.272 up 1.00000 1.00000 -37 7.29999 host kzn-osd06-nvme 273 ssd 7.29999 osd.273 up 1.00000 1.00000 -33 7.29999 host kzn-osd07-nvme 274 ssd 7.29999 osd.274 up 1.00000 1.00000 -39 7.29999 host kzn-osd08-nvme 275 ssd 7.29999 osd.275 up 1.00000 1.00000 -41 7.29999 host kzn-osd09-nvme 276 ssd 7.29999 osd.276 up 1.00000 1.00000 -43 7.29999 host kzn-osd10-nvme 277 ssd 7.29999 osd.277 up 1.00000 1.00000 -45 7.29999 host kzn-osd11-nvme 278 ssd 7.29999 osd.278 up 1.00000 1.00000 -47 7.29999 host kzn-osd12-nvme 279 ssd 7.29999 osd.279 up 1.00000 1.00000 -49 7.29999 host kzn-osd13-nvme 280 ssd 7.29999 osd.280 up 1.00000 1.00000 -51 7.29999 host kzn-osd14-nvme 281 ssd 7.29999 osd.281 up 1.00000 1.00000 -1001 1069.16003 root size -6 106.91600 host kzn-osd05 152 hdd 8.90999 osd.152 up 1.00000 1.00000 153 hdd 8.90999 osd.153 up 1.00000 1.00000 154 hdd 8.90999 osd.154 up 1.00000 1.00000 155 hdd 8.90999 osd.155 up 1.00000 1.00000 156 hdd 8.90999 osd.156 up 1.00000 1.00000 157 hdd 8.90999 osd.157 up 1.00000 1.00000 158 hdd 8.90999 osd.158 up 1.00000 1.00000 159 hdd 8.90999 osd.159 up 1.00000 1.00000 160 hdd 8.90999 osd.160 up 1.00000 1.00000 161 hdd 8.90999 osd.161 up 1.00000 1.00000 162 hdd 8.90999 osd.162 up 1.00000 1.00000 163 hdd 8.90999 osd.163 up 1.00000 1.00000 -7 106.91600 host kzn-osd06 164 hdd 8.90999 osd.164 up 1.00000 1.00000 165 hdd 8.90999 osd.165 up 1.00000 1.00000 166 hdd 8.90999 osd.166 up 1.00000 1.00000 167 hdd 8.90999 osd.167 up 1.00000 1.00000 168 hdd 8.90999 osd.168 up 1.00000 1.00000 169 hdd 8.90999 osd.169 up 1.00000 1.00000 170 hdd 8.90999 osd.170 up 1.00000 1.00000 171 hdd 8.90999 osd.171 up 1.00000 1.00000 172 hdd 8.90999 osd.172 up 1.00000 1.00000 173 hdd 8.90999 osd.173 up 1.00000 1.00000 174 hdd 8.90999 osd.174 up 1.00000 1.00000 175 hdd 8.90999 osd.175 up 1.00000 1.00000 -8 106.91600 host kzn-osd07 176 hdd 8.90999 osd.176 up 1.00000 1.00000 177 hdd 8.90999 osd.177 up 1.00000 1.00000 178 hdd 8.90999 osd.178 up 1.00000 1.00000 179 hdd 8.90999 osd.179 up 1.00000 1.00000 180 hdd 8.90999 osd.180 up 1.00000 1.00000 181 hdd 8.90999 osd.181 up 1.00000 1.00000 182 hdd 8.90999 osd.182 up 1.00000 1.00000 183 hdd 8.90999 osd.183 up 1.00000 1.00000 184 hdd 8.90999 osd.184 up 1.00000 1.00000 185 hdd 8.90999 osd.185 up 1.00000 1.00000 186 hdd 8.90999 osd.186 up 1.00000 1.00000 187 hdd 8.90999 osd.187 up 1.00000 1.00000 -9 106.91600 host kzn-osd08 188 hdd 8.90999 osd.188 up 1.00000 1.00000 189 hdd 8.90999 osd.189 up 1.00000 1.00000 190 hdd 8.90999 osd.190 up 1.00000 1.00000 191 hdd 8.90999 osd.191 up 1.00000 1.00000 192 hdd 8.90999 osd.192 up 1.00000 1.00000 193 hdd 8.90999 osd.193 up 1.00000 1.00000 194 hdd 8.90999 osd.194 up 1.00000 1.00000 195 hdd 8.90999 osd.195 up 1.00000 1.00000 196 hdd 8.90999 osd.196 up 1.00000 1.00000 197 hdd 8.90999 osd.197 up 1.00000 1.00000 198 hdd 8.90999 osd.198 up 1.00000 1.00000 199 hdd 8.90999 osd.199 up 1.00000 1.00000 -21 106.91600 host kzn-osd09 200 hdd 8.90999 osd.200 up 1.00000 1.00000 201 hdd 8.90999 osd.201 up 1.00000 1.00000 202 hdd 8.90999 osd.202 up 1.00000 1.00000 203 hdd 8.90999 osd.203 up 1.00000 1.00000 204 hdd 8.90999 osd.204 up 1.00000 1.00000 205 hdd 8.90999 osd.205 up 1.00000 1.00000 206 hdd 8.90999 osd.206 up 1.00000 1.00000 207 hdd 8.90999 osd.207 up 1.00000 1.00000 208 hdd 8.90999 osd.208 up 1.00000 1.00000 209 hdd 8.90999 osd.209 up 1.00000 1.00000 210 hdd 8.90999 osd.210 up 1.00000 1.00000 211 hdd 8.90999 osd.211 up 1.00000 1.00000 -23 106.91600 host kzn-osd10 212 hdd 8.90999 osd.212 up 1.00000 1.00000 213 hdd 8.90999 osd.213 up 1.00000 1.00000 214 hdd 8.90999 osd.214 up 1.00000 1.00000 215 hdd 8.90999 osd.215 up 1.00000 1.00000 216 hdd 8.90999 osd.216 up 1.00000 1.00000 217 hdd 8.90999 osd.217 up 1.00000 1.00000 218 hdd 8.90999 osd.218 up 1.00000 1.00000 219 hdd 8.90999 osd.219 up 1.00000 1.00000 220 hdd 8.90999 osd.220 up 1.00000 1.00000 221 hdd 8.90999 osd.221 up 1.00000 1.00000 222 hdd 8.90999 osd.222 up 1.00000 1.00000 223 hdd 8.90999 osd.223 up 1.00000 1.00000 -25 106.91600 host kzn-osd11 224 hdd 8.90999 osd.224 up 1.00000 1.00000 225 hdd 8.90999 osd.225 up 1.00000 1.00000 226 hdd 8.90999 osd.226 up 1.00000 1.00000 227 hdd 8.90999 osd.227 up 1.00000 1.00000 228 hdd 8.90999 osd.228 up 1.00000 1.00000 229 hdd 8.90999 osd.229 up 1.00000 1.00000 230 hdd 8.90999 osd.230 up 1.00000 1.00000 231 hdd 8.90999 osd.231 up 1.00000 1.00000 232 hdd 8.90999 osd.232 up 1.00000 1.00000 233 hdd 8.90999 osd.233 up 1.00000 1.00000 234 hdd 8.90999 osd.234 up 1.00000 1.00000 235 hdd 8.90999 osd.235 up 1.00000 1.00000 -27 106.91600 host kzn-osd12 236 hdd 8.90999 osd.236 up 1.00000 1.00000 237 hdd 8.90999 osd.237 up 1.00000 1.00000 238 hdd 8.90999 osd.238 up 1.00000 1.00000 239 hdd 8.90999 osd.239 up 1.00000 1.00000 240 hdd 8.90999 osd.240 up 1.00000 1.00000 241 hdd 8.90999 osd.241 up 1.00000 1.00000 242 hdd 8.90999 osd.242 up 1.00000 1.00000 243 hdd 8.90999 osd.243 up 1.00000 1.00000 244 hdd 8.90999 osd.244 up 1.00000 1.00000 245 hdd 8.90999 osd.245 up 1.00000 1.00000 246 hdd 8.90999 osd.246 up 1.00000 1.00000 247 hdd 8.90999 osd.247 up 1.00000 1.00000 -29 106.91600 host kzn-osd13 248 hdd 8.90999 osd.248 up 1.00000 1.00000 249 hdd 8.90999 osd.249 up 1.00000 1.00000 250 hdd 8.90999 osd.250 up 1.00000 1.00000 251 hdd 8.90999 osd.251 up 1.00000 1.00000 252 hdd 8.90999 osd.252 up 1.00000 1.00000 253 hdd 8.90999 osd.253 down 0 1.00000 254 hdd 8.90999 osd.254 up 1.00000 1.00000 255 hdd 8.90999 osd.255 up 1.00000 1.00000 256 hdd 8.90999 osd.256 up 1.00000 1.00000 257 hdd 8.90999 osd.257 up 1.00000 1.00000 258 hdd 8.90999 osd.258 up 1.00000 1.00000 259 hdd 8.90999 osd.259 up 1.00000 1.00000 -31 106.91600 host kzn-osd14 260 hdd 8.90999 osd.260 up 1.00000 1.00000 261 hdd 8.90999 osd.261 up 1.00000 1.00000 262 hdd 8.90999 osd.262 up 1.00000 1.00000 263 hdd 8.90999 osd.263 up 1.00000 1.00000 264 hdd 8.90999 osd.264 up 1.00000 1.00000 265 hdd 8.90999 osd.265 up 1.00000 1.00000 266 hdd 8.90999 osd.266 up 1.00000 1.00000 267 hdd 8.90999 osd.267 up 1.00000 1.00000 268 hdd 8.90999 osd.268 up 1.00000 1.00000 269 hdd 8.90999 osd.269 up 1.00000 1.00000 270 hdd 8.90999 osd.270 up 1.00000 1.00000 271 hdd 8.90999 osd.271 up 1.00000 1.00000 -1 0 root default ``` The hosts are from `kzn-osd05` to `kzn-osd14`. Hosts `kzn-osd01` to `kzn-osd04` were the fujitsu servers that used to host the performance root, those machines are now decommisioned/repurposed. You'll notice that for every `kzn-osdX` host we have a corresponding `kzn-osdX-nvme` this was done to trick ceph into letting us put the nVme drives in a different root even though they are in the same physical host. This configuration flag must be set to false, otherwise ceph will move the nvmes to their respective hosts. ``` [root@kzn-mon1 ~]# grep "osd crush update on start" /etc/ceph/ceph.conf osd crush update on start = false ``` ## Pools Here are the current pools we have. This will, of course, change over time. ``` [root@kzn-mon1 ~]# ceph df RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 1.0 PiB 861 TiB 199 TiB 199 TiB 18.78 ssd 73 TiB 48 TiB 24 TiB 24 TiB 33.39 TOTAL 1.1 PiB 910 TiB 223 TiB 223 TiB 19.72 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL .rgw.root 72 1.1 KiB 7 1.1 MiB 0 239 TiB rbd 92 45 GiB 11.71k 135 GiB 0.02 239 TiB .rgw 112 541 KiB 2.39k 448 MiB 0 239 TiB .rgw.gc 113 173 KiB 32 13 MiB 0 239 TiB .users.uid 114 226 KiB 278 36 MiB 0 239 TiB .usage 115 65 MiB 32 65 MiB 0 239 TiB .rgw.buckets.index 116 45 MiB 1.21k 45 MiB 0 239 TiB .rgw.control 119 0 B 8 0 B 0 239 TiB .rgw.buckets 133 6.7 TiB 1.94M 10 TiB 1.43 477 TiB .log 139 28 B 177 192 KiB 0 239 TiB .users 141 121 B 3 576 KiB 0 239 TiB .rgw.buckets.extra 143 132 KiB 5 1.1 MiB 0 239 TiB m2 144 6.6 TiB 1.74M 20 TiB 2.70 239 TiB ostack2 145 42 TiB 40.56M 129 TiB 15.30 239 TiB performance2 146 8.1 TiB 2.12M 24 TiB 37.15 14 TiB cephfs 149 13 TiB 3.72M 39 TiB 5.16 239 TiB cephfsmeta 150 257 MiB 158 771 MiB 0 239 TiB openshift4 153 532 MiB 194 1.6 GiB 0 239 TiB default.rgw.meta 159 0 B 0 0 B 0 239 TiB default.rgw.control 160 0 B 0 0 B 0 239 TiB default.rgw.log 161 0 B 0 0 B 0 239 TiB .users.email 168 0 B 0 0 B 0 32 TiB ``` I have enabled the pg autoscaler so it automatillcay manages the number of placement groups for each pool. ``` [root@kzn-mon1 ~]# ceph osd pool autoscale-status POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE .users.email 0 1.28571426868 74520G 0.0000 1.0 16 on .users.uid 225.8k 3.0 1060T 0.0000 1.0 16 on .rgw 541.0k 3.0 1060T 0.0000 1.0 32 on .rgw.buckets 6837G 1.5 1060T 0.0094 1.0 64 on openshift4 532.3M 3.0 1060T 0.0000 1.0 64 on .rgw.gc 173.1k 3.0 1060T 0.0000 1.0 16 on performance2 8271G 3.0 74520G 0.3330 1.0 256 on default.rgw.meta 0 3.0 1060T 0.0000 1.0 16 on .log 28 3.0 1060T 0.0000 1.0 16 on m2 6778G 3.0 1060T 0.0187 1.0 128 on default.rgw.log 0 3.0 1060T 0.0000 1.0 16 on .rgw.buckets.extra 132.4k 3.0 1060T 0.0000 1.0 32 on .rgw.buckets.index 46307k 3.0 1060T 0.0000 1.0 32 on .rgw.root 1090 3.0 1060T 0.0000 1.0 16 on rbd 46038M 3.0 1060T 0.0001 1.0 32 on .users 121 3.0 1060T 0.0000 1.0 16 on .rgw.control 0 3.0 1060T 0.0000 1.0 16 on default.rgw.control 0 3.0 1060T 0.0000 1.0 16 on .usage 66180k 3.0 1060T 0.0000 1.0 32 on ostack2 42525G 3.0 1060T 0.1175 1.0 1024 on cephfsmeta 256.9M 3.0 1060T 0.0000 1.0 16 on cephfs 13303G 3.0 1060T 0.0368 1.0 128 on ``` ## Rados Gateways We have 4 RGWs, setup in pairs of 2. The first pair, `kzn-rgw1` and `kzn-rgw2`, are used for OpenStack Swift and as such are configured to work with keystone. These have public IP addresses. The second pair, `kzn-rgw-j-01` and `kzn-rgw-j-02`, are used by the baremetal OpenShift 4.X cluster with the OCS operator. These hosts do not have a public IP since it's not required. Both pairs of RGWs have a virtual IP that is managed by `corosync`, `pcs`, and `pacemaker` for high availibilty. To check that status of the cluster, run `pcs status` or `crm_mon` ``` [root@kzn-rgw-j-01 ~]# pcs status Cluster name: openshift_rgws Stack: corosync Current DC: kzn-rgw-j-02 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum Last updated: Fri May 14 10:41:42 2021 Last change: Fri May 14 10:41:35 2021 by hacluster via crmd on kzn-rgw-j-01 2 nodes configured 1 resource instance configured Online: [ kzn-rgw-j-01 kzn-rgw-j-02 ] Full list of resources: virtual_ip (ocf::heartbeat:IPaddr2): Started kzn-rgw-j-01 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ``` You can notice that there are 2 hosts that are online and that the `virtual_ip` is currently assigned to host `kzn-rgw-j-01`.