Post

Ceph Cheat Sheet for Linux Sysadmins

Ceph Cheat Sheet for Linux Sysadmins

Ceph Cheat Sheet

A quick reference guide for Linux sysadmins managing Ceph clusters.


🔹 Ceph Basics

  • Cluster Status:
    1
    2
    
    ceph status
    ceph -s              # Short form
    
  • Check Ceph Health:
    1
    
    ceph health detail
    
  • Cluster Configuration Dump:
    1
    
    ceph config dump
    

🔹 Monitor & Logs

  • View Monitor Map:
    1
    
    ceph mon dump
    
  • Monitor Logs:
    1
    
    journalctl -u ceph-mon@<host> -f
    

🔹 OSD Management

  • List OSDs:
    1
    2
    
    ceph osd tree
    ceph osd ls
    
  • OSD Usage:
    1
    2
    
    ceph osd df
    ceph osd utilization
    
  • Mark OSD Out / In:
    1
    2
    
    ceph osd out <osd_id>
    ceph osd in <osd_id>
    
  • Stop / Start OSD Service:
    1
    2
    
    systemctl stop ceph-osd@<id>
    systemctl start ceph-osd@<id>
    
  • Remove OSD:
    1
    
    ceph osd purge <osd_id> --yes-i-really-mean-it
    

🔹 Pool Management

  • List Pools:
    1
    
    ceph osd pool ls
    
  • Create a Pool:
    1
    
    ceph osd pool create mypool 128 128 replicated
    
  • Delete a Pool:
    1
    
    ceph osd pool delete mypool mypool --yes-i-really-really-mean-it
    
  • Get Pool Stats:
    1
    2
    
    ceph df
    ceph osd pool stats mypool
    

🔹 PG (Placement Groups)

  • Check PGs:
    1
    2
    
    ceph pg stat
    ceph pg dump | less
    
  • Repair PG:
    1
    
    ceph pg repair <pg_id>
    

🔹 CRUSH Map

  • View CRUSH Map:
    1
    2
    
    ceph osd crush tree
    ceph osd crush dump | less
    
  • Edit CRUSH Map:
    1
    2
    3
    4
    5
    
    ceph osd getcrushmap -o map.bin
    crushtool -d map.bin -o map.txt
    # edit map.txt
    crushtool -c map.txt -o newmap.bin
    ceph osd setcrushmap -i newmap.bin
    

🔹 RADOS & RGW

  • Check RGW Users:
    1
    
    radosgw-admin user list
    
  • Create a User:
    1
    
    radosgw-admin user create --uid="testuser" --display-name="Test User"
    
  • Get User Key:
    1
    
    radosgw-admin key create --uid="testuser"
    
  • Check Bucket Stats:
    1
    
    radosgw-admin bucket stats --bucket=mybucket
    

🔹 MDS (CephFS)

  • List Filesystems:
    1
    
    ceph fs ls
    
  • Check MDS Status:
    1
    
    ceph mds stat
    
  • Create Filesystem:
    1
    
    ceph fs new cephfs cephfs_metadata cephfs_data
    

🔹 Debug & Troubleshooting

  • Cluster Health:
    1
    
    ceph health detail
    
  • Slow Requests:
    1
    
    ceph health | grep slow
    
  • Check for Scrubbing / Recovery:
    1
    
    ceph status | egrep "recovery|scrub"
    
  • Detailed Logs:
    1
    
    ceph -w       # Watch cluster events
    

🔹 Quick Recovery Scenarios

🟢 MON Down

  • Check quorum:
    1
    
    ceph quorum_status | jq '.quorum_names'
    
  • Restart MON service:
    1
    
    systemctl restart ceph-mon@<host>
    
  • If still failing, remove and re-add MON:
    1
    2
    
    ceph mon remove <mon_id>
    ceph mon add <mon_id> <ip:port>
    

🟡 OSD Down / Lost

  • Restart OSD:
    1
    
    systemctl restart ceph-osd@<id>
    
  • Mark OSD in if safe:
    1
    
    ceph osd in <osd_id>
    
  • If permanently failed:
    1
    
    ceph osd purge <osd_id> --yes-i-really-mean-it
    

🟠 PGs Stuck in peering / degraded

  • Check PGs:
    1
    
    ceph pg dump | grep <pg_id>
    
  • Force PG repair:
    1
    
    ceph pg repair <pg_id>
    
  • Kick stuck OSD:
    1
    
    ceph osd out <osd_id>
    

🔴 Full Cluster / No Free Space

  • Check space:
    1
    2
    
    ceph df
    ceph osd df
    
  • Add new OSD(s).
  • Set nearfull / full ratio to safe values:
    1
    2
    
    ceph osd set-nearfull-ratio 0.85
    ceph osd set-full-ratio 0.95
    

🔧 Stuck Recovery / Backfill

  • Pause recovery temporarily:
    1
    
    ceph osd set norecover
    
  • Resume after troubleshooting:
    1
    
    ceph osd unset norecover
    

🔹 Tuning Flags Cheat Sheet

FlagCommand ExampleUse Case
nooutceph osd set nooutPrevents OSDs marked down from being automatically marked out.
nobackfillceph osd set nobackfillPrevents backfilling to avoid heavy IO load during upgrades or testing.
norebalanceceph osd set norebalanceStops automatic data balancing.
norecoverceph osd set norecoverDisables recovery processes.
pauseceph osd set pausePauses client IO cluster-wide (use very carefully!).
noscrubceph osd set noscrubDisables scrubbing temporarily.
nodeep-scrubceph osd set nodeep-scrubDisables deep scrubbing temporarily.

Always unset flags when done, e.g.:

1
ceph osd unset noout

🔹 Legend (Ceph Terminology)

  • MON (Monitor) – Maintains cluster maps, ensures quorum.
  • OSD (Object Storage Daemon) – Stores data, handles replication, recovery, backfill, rebalancing. One OSD = one disk.
  • PG (Placement Group) – Logical grouping of objects across OSDs. Helps map data to OSDs.
  • CRUSH Map – Algorithm/map that decides data placement across OSDs.
  • MDS (Metadata Server) – Manages metadata for CephFS (directories, file ownership, permissions).
  • RADOS – Reliable Autonomic Distributed Object Store (core of Ceph).
  • RGW (RADOS Gateway) – S3/Swift-compatible object storage gateway.
  • Scrubbing – Consistency check between objects and replicas (light = metadata only, deep = full data).
  • Backfill – Process of redistributing data to OSDs when new OSDs are added or after recovery.
  • Recovery – Process of replicating data when an OSD fails or comes back online.

🔹 Best Practices

  • Monitor daily:
    1
    2
    
    ceph status
    ceph health detail
    
  • Always check PG health after adding/removing OSDs.
  • Use ceph osd df before expansion to balance data distribution.
  • Regularly backup:
    • ceph.conf
    • Keyrings
    • Monitor DB
  • Schedule scrubbing and monitor for stuck PGs.
  • Avoid pools with <64 PGs in production.
This post is licensed under CC BY 4.0 by the author.