Ceph Cheat Sheet for Linux Sysadmins
Ceph Cheat Sheet for Linux Sysadmins
Ceph Cheat Sheet
A quick reference guide for Linux sysadmins managing Ceph clusters.
🔹 Ceph Basics
- Cluster Status:
1 2
ceph status ceph -s # Short form
- Check Ceph Health:
1
ceph health detail
- Cluster Configuration Dump:
1
ceph config dump
🔹 Monitor & Logs
- View Monitor Map:
1
ceph mon dump
- Monitor Logs:
1
journalctl -u ceph-mon@<host> -f
🔹 OSD Management
- List OSDs:
1 2
ceph osd tree ceph osd ls - OSD Usage:
1 2
ceph osd df ceph osd utilization - Mark OSD Out / In:
1 2
ceph osd out <osd_id> ceph osd in <osd_id> - Stop / Start OSD Service:
1 2
systemctl stop ceph-osd@<id> systemctl start ceph-osd@<id>
- Remove OSD:
1
ceph osd purge <osd_id> --yes-i-really-mean-it
🔹 Pool Management
- List Pools:
1
ceph osd pool ls - Create a Pool:
1
ceph osd pool create mypool 128 128 replicated
- Delete a Pool:
1
ceph osd pool delete mypool mypool --yes-i-really-really-mean-it - Get Pool Stats:
1 2
ceph df ceph osd pool stats mypool
🔹 PG (Placement Groups)
- Check PGs:
1 2
ceph pg stat ceph pg dump | less - Repair PG:
1
ceph pg repair <pg_id>
🔹 CRUSH Map
- View CRUSH Map:
1 2
ceph osd crush tree ceph osd crush dump | less
- Edit CRUSH Map:
1 2 3 4 5
ceph osd getcrushmap -o map.bin crushtool -d map.bin -o map.txt # edit map.txt crushtool -c map.txt -o newmap.bin ceph osd setcrushmap -i newmap.bin
🔹 RADOS & RGW
- Check RGW Users:
1
radosgw-admin user list
- Create a User:
1
radosgw-admin user create --uid="testuser" --display-name="Test User"
- Get User Key:
1
radosgw-admin key create --uid="testuser"
- Check Bucket Stats:
1
radosgw-admin bucket stats --bucket=mybucket
🔹 MDS (CephFS)
- List Filesystems:
1
ceph fs ls - Check MDS Status:
1
ceph mds stat - Create Filesystem:
1
ceph fs new cephfs cephfs_metadata cephfs_data
🔹 Debug & Troubleshooting
- Cluster Health:
1
ceph health detail
- Slow Requests:
1
ceph health | grep slow - Check for Scrubbing / Recovery:
1
ceph status | egrep "recovery|scrub" - Detailed Logs:
1
ceph -w # Watch cluster events
🔹 Quick Recovery Scenarios
🟢 MON Down
- Check quorum:
1
ceph quorum_status | jq '.quorum_names' - Restart MON service:
1
systemctl restart ceph-mon@<host>
- If still failing, remove and re-add MON:
1 2
ceph mon remove <mon_id> ceph mon add <mon_id> <ip:port>
🟡 OSD Down / Lost
- Restart OSD:
1
systemctl restart ceph-osd@<id>
- Mark OSD in if safe:
1
ceph osd in <osd_id> - If permanently failed:
1
ceph osd purge <osd_id> --yes-i-really-mean-it
🟠PGs Stuck in peering / degraded
- Check PGs:
1
ceph pg dump | grep <pg_id> - Force PG repair:
1
ceph pg repair <pg_id>
- Kick stuck OSD:
1
ceph osd out <osd_id>
🔴 Full Cluster / No Free Space
- Check space:
1 2
ceph df ceph osd df
- Add new OSD(s).
- Set nearfull / full ratio to safe values:
1 2
ceph osd set-nearfull-ratio 0.85 ceph osd set-full-ratio 0.95
🔧 Stuck Recovery / Backfill
- Pause recovery temporarily:
1
ceph osd set norecover - Resume after troubleshooting:
1
ceph osd unset norecover
🔹 Tuning Flags Cheat Sheet
| Flag | Command Example | Use Case |
|---|---|---|
noout | ceph osd set noout | Prevents OSDs marked down from being automatically marked out. |
nobackfill | ceph osd set nobackfill | Prevents backfilling to avoid heavy IO load during upgrades or testing. |
norebalance | ceph osd set norebalance | Stops automatic data balancing. |
norecover | ceph osd set norecover | Disables recovery processes. |
pause | ceph osd set pause | Pauses client IO cluster-wide (use very carefully!). |
noscrub | ceph osd set noscrub | Disables scrubbing temporarily. |
nodeep-scrub | ceph osd set nodeep-scrub | Disables deep scrubbing temporarily. |
Always unset flags when done, e.g.:
1
ceph osd unset noout
🔹 Legend (Ceph Terminology)
- MON (Monitor) – Maintains cluster maps, ensures quorum.
- OSD (Object Storage Daemon) – Stores data, handles replication, recovery, backfill, rebalancing. One OSD = one disk.
- PG (Placement Group) – Logical grouping of objects across OSDs. Helps map data to OSDs.
- CRUSH Map – Algorithm/map that decides data placement across OSDs.
- MDS (Metadata Server) – Manages metadata for CephFS (directories, file ownership, permissions).
- RADOS – Reliable Autonomic Distributed Object Store (core of Ceph).
- RGW (RADOS Gateway) – S3/Swift-compatible object storage gateway.
- Scrubbing – Consistency check between objects and replicas (light = metadata only, deep = full data).
- Backfill – Process of redistributing data to OSDs when new OSDs are added or after recovery.
- Recovery – Process of replicating data when an OSD fails or comes back online.
🔹 Best Practices
- Monitor daily:
1 2
ceph status ceph health detail
- Always check PG health after adding/removing OSDs.
- Use
ceph osd dfbefore expansion to balance data distribution. - Regularly backup:
ceph.conf- Keyrings
- Monitor DB
- Schedule scrubbing and monitor for stuck PGs.
- Avoid pools with <64 PGs in production.
This post is licensed under CC BY 4.0 by the author.