Ceph
Where to find docs
The official Ceph documentation is located on https://docs.ceph.com/en/latest/rados/operations/
It is strongly advised to use the documentation for the version being used.
- Pacific - https://docs.ceph.com/en/pacific/rados/operations/
- Quincy - https://docs.ceph.com/en/quincy/rados/operations/
- Reef - https://docs.ceph.com/en/reef/rados/operations/
Do not take information in the documentation at face value. Especially when it comes to advanced/rarely used/very new features it is strongly advised to test any claims made in the documentation about any particular feature.
Never assume that things will work as written without actually testing it on a test setup as close to your real workload scenario as possible.
Advice on Ceph releases
The current Ceph releases and their support status can be found on https://docs.ceph.com/en/latest/releases/
When a new Ceph stable version is released you are strongly advised to not roll it out on any production cluster whatsoever. Even though its listed as "stable" it doesn't mean that this is actually true. Especially avoid using .0 releases on anything remotely production unless you really, really now what you're doing and can live with a possible catastrophic failure.
Be very conservative about what version you run on production systems.
Shiny new features aren't worth the risk of total or partial data loss/corruption.
General maintenance
60 seconds cluster overview
The following commands can be used to quickly check the status of Ceph:
-
Print overall cluster status
ceph -s
-
Print detailed health information
ceph health detail
-
Display current OSD tree
ceph osd tree
-
Cluster storage usage by pool and storage class
ceph df
-
List pools with detailed configuration
ceph osd pool ls detail
-
Get usage stats for OSDs
ceph osd df {plain|tree} {class e.g. hdd|ssd}
-
Watch Ceph health messages sequentially
ceph -w
-
List daemon versions running in the cluster
ceph versions
Also you can run the following on each node running ceph-daemons, to provide further debug information about the environment:
# lscpu
# cat /proc/cpuinfo # if lscpu isn't available
# free -g
# ip l
# ethtool <device> # for each network adapter
Mute/Unmute a health warning
$ ceph health mute <what> <duration>
$ ceph health unmute <what>
Disable/Enable (deep-)scrubbing
$ ceph osd set noscrub
$ ceph osd set nodeep-scrub
$ ceph osd unset noscrub
$ ceph osd unset nodeep-scrub
Use this sparingly only in emergency situations. Setting these flags will cause a HEALTH_WARN status, increase risk of data corruption and also the risk of generating a HEALTH_WARN due to PGs not being (deep-)scrubbed in time.
Reboot a single node
The traditional way of doing this is by setting the noout
flag,
do the appropriate maintenance work and after the node is back online
unset the flag like so:
ceph osd set noout
After maintenance is done and host is back up:
ceph osd unset noout
On versions Luminous or above you can set the flag individually for single OSDs or entire CRUSH buckets, which can be a safer option in case of prolonged maintenance periods.
Add noout for a OSD:
ceph osd add-noout osd.<ID>
Remove noout for a OSD:
ceph osd rm-noout osd.<ID>
Add noout for CRUSH bucket (e.g. host name as seen in ceph osd tree
):
ceph osd set-group noout <crush-bucket-name>
Remove noout for CRUSH bucket:
ceph osd unset-group noout <crush-bucket-name>
Gathering information about block devices
Enumerate typical storage devices and LVM
# lsblk
# lsblk -S
# lsscsi
# nvme list
# pvs
# vgs
# lvs
SMART data for SATA/SAS and NVME devices
# smartctl -a /dev/sdX
# nvme smart-log /dev/nvmeXnY
Check format of a NVME device
# nvme id-ns -H /dev/nvmeXnY
Check the last lines named "LBA Format". It will show which formats are supported, which format is in use and which format offers the best performance according to the vendor.
Format a NVME device to a different LBA format using nvme-cli
This will destroy all data on the device!
# nvme format --lbaf=<id> /dev/nvmeXnY
Secure Erase a NVME drive using nvme-cli
This will destroy all data on the device!
# nvme format -s2 /dev/nvmeXnY
# blkdiscard /dev/nvmeXnY
# nvme format -s1 /dev/nvmeXnY
Secure Erase a SATA/SAS drive using hdparm
This will destroy all data on the device!
-
Gather device info:
# hdparm -I /dev/sdX
Check that the output says "not frozen" and "not locked", also it should list support for enhanced erase and list time estimates for SECURITY ERASE UNIT and/or ENHANCED SECURITY ERASE UNIT
-
Set a master password for the disk (required, will be automatically removed after wipe)
# hdparm --user-master wipeit --security-set-pass wipeit /dev/sdX
# hdparm -I /dev/sdXCheck that "Security level" is now "high" and master password is now "enabled" instead of "not enabled" before
-
Wipe the device
If device supports enhanced security erase (better), use the following:
# hdparm --user-master wipeit --security-erase-enhanced wipeit /dev/sdX
If not, use standard security erase:
# hdparm --user-master wipeit --security-erase wipeit /dev/sdX
On some systems the system firmware might "freeze" the device, which makes it impossible to issue a secure erase or reformat the device. In that case it might be necessary to either "unfreeze" the drive or to install the drive in another system where it can be unfrozen. Also make sure that the device is actually wiped. Its recommended to at least perform a blanking pass on HDDs with a tool like nwipe.
OSD maintenance tasks
Locate a specific OSD in the cluster
$ ceph osd find osd.<ID>
Get OSD metadata (global and single OSD)
$ ceph osd metadata
$ ceph osd metadata osd.<ID>
Interesting fields:
- bluefs_db_rotational
- bluefs_dedicated_db
- bluefs_dedicated_wal
- bluefs_wal_rotational
- bluestore_bdev_rotational
- device_ids
- device_paths
- devices
- hostname
- osd_objectstore
- rotational
Add a new OSD
-
Prepare the configuration for the new OSD first. Details on adding the configuration for a new OSD in the Ceph configuration guide.
-
Deploy the new OSD service on
<nodename>
.osism apply ceph-osds -l <nodename> -e ceph_handler_osds_restart=false
Replace a defect OSD
Remove a OSD
As with ‘Remove a single OSD node’. Except that the steps are only executed for a single OSD and the node is not removed from the CRUSH map and the inventory. Only the entries relating to the removed OSD are removed from the host vars.
Manual way
$ ceph osd crush reweight osd.<ID> 0.0
# Wait for rebalance to complete...
$ ceph osd out osd.<ID>
# systemctl stop ceph-osd@<ID>
# systemctl disable ceph-osd@<ID>
$ ceph osd purge osd.<ID> --yes-i-really-mean-it
The LV and VG defined in the inventory for this OSD must also be removed. The OSD itself should be wiped.
Remove a single OSD node
-
Get all OSDs of the node
$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.11691 root default
-3 0.03897 host testbed-node-0
0 hdd 0.01949 osd.0 up 1.00000 1.00000
4 hdd 0.01949 osd.4 up 1.00000 1.00000
-5 0.03897 host testbed-node-1
1 hdd 0.01949 osd.1 up 1.00000 1.00000
3 hdd 0.01949 osd.3 up 1.00000 1.00000
-7 0.03897 host testbed-node-2
2 hdd 0.01949 osd.2 up 1.00000 1.00000
5 hdd 0.01949 osd.5 up 1.00000 1.00000 -
Reduce the weighting of all OSDs on the node to 0. Do this for each OSD in a row and wait after each adjustment until the Ceph cluster is balanced. Depending on how large the Ceph cluster and the individual OSDs are, this may take some time.
$ ceph osd crush reweight osd.2 0.0
$ ceph osd crush reweight osd.5 0.0The Ceph OSDs that are to be removed then have a weight of 0.
$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.07794 root default
-3 0.03897 host testbed-node-0
0 hdd 0.01949 osd.0 up 1.00000 1.00000
4 hdd 0.01949 osd.4 up 1.00000 1.00000
-5 0.03897 host testbed-node-1
1 hdd 0.01949 osd.1 up 1.00000 1.00000
3 hdd 0.01949 osd.3 up 1.00000 1.00000
-7 0 host testbed-node-2
2 hdd 0 osd.2 up 1.00000 1.00000
5 hdd 0 osd.5 up 1.00000 1.00000 -
Remove the OSDs and everything that belongs to them from the node. This is a disruptive action that cannot be undone. The devices used are also reset.
$ osism apply ceph-shrink-osd -e ireallymeanit=yes -e osd_to_kill=2,5
All OSDs were removed.
$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.07794 root default
-3 0.03897 host testbed-node-0
0 hdd 0.01949 osd.0 up 1.00000 1.00000
4 hdd 0.01949 osd.4 up 1.00000 1.00000
-5 0.03897 host testbed-node-1
1 hdd 0.01949 osd.1 up 1.00000 1.00000
3 hdd 0.01949 osd.3 up 1.00000 1.00000
-7 0 host testbed-node-2 -
Remove the node from the CRUSH map.
$ ceph osd crush remove testbed-node-2
removed item id -7 name 'testbed-node-2' from crush map -
Remove the node from all Ceph groups in the inventory.
-
Remove all Ceph-specific parameters from the host vars of the node from the inventory
Remove an OSD (temporarily e.g. when replacing a broken disk)
$ ceph osd out osd.<ID>
# systemctl stop ceph-osd@<ID>
# systemctl disable ceph-osd@<ID>
Disable backfills/recovery completely
Use only in emergency situations!
$ ceph osd set nobackfill
$ ceph osd set norecovery
$ ceph osd set norebalance
Unset the flags with ceph osd unset <flag>
.