CEPH-5: basic concept and management of ceph cluster

Basic concept and management of ceph cluster

Basic concept of ceph cluster

  1. Overall structure of ceph cluster
name effect
osd The full name is Object Storage Device. Its main functions are to store data, copy data, balance data, restore data, etc. Heartbeat check will be conducted between each OSD and some changes will be reported to Ceph Monitor.
mon The full name is Monitor, which is responsible for monitoring Ceph cluster, maintaining the health status of Ceph cluster, and maintaining various Map maps in Ceph cluster, such as OSD Map, Monitor Map, PG Map and CRUSH Map. These maps are collectively referred to as Cluster Map, and the final storage location of data is calculated according to the Map map and object id.
mgr The full name is Manager, which is responsible for tracking runtime indicators and the current status of Ceph cluster, including storage utilization, current performance indicators and system load.
mds The full name is MetaData Server, which mainly stores the metadata of file system services. It can only be enabled if cephfs function is used. Object storage and block storage devices do not need to use this service.
rgw Its full name is radosgw. It is a gateway based on the current popular RESTFUL protocol. It is the entrance of ceph object storage and embedded civetweb service. If object storage is not enabled, it does not need to be installed.
  1. ceph profile

    Standard location: / etc / CEPH / CEPH conf


    ## Global configuration
    fsid = 537175bb-51de-4cc4-9ee3-b5ba8842bff2
    public_network =
    cluster_network =
    mon_initial_members = ceph-node1
    mon_host = 10.153.204.xx:6789,10.130.22.xx:6789,10.153.204.xx:6789
    auth_cluster_required = cephx
    auth_service_required = cephx
    auth_client_required = cephx
    ## osd special configuration, you can use osd Num to indicate which osd is specific
    ## Monitor special configuration, you can use monitor A represents the specific monitor, where a represents the name of the node, which can be viewed by using ceph mon dump.
    ## Client specific configuration

    Loading order of ceph configuration file:

    • $CEPH_CONF environment variable
    • -c designated location
    • /etc/ceph/ceph.conf
    • ~/.ceph/ceph.conf
    • ./ceph.conf
  2. Storage pool type

    • Replica pool: replicated
      • Defines how many copies of each object are saved in the cluster. By default, there are three copies, one active and two standby
      • For high availability, replica pool is the default storage pool type of ceph.
    • erasure code pool
      • Each object is stored as N=K+M blocks, where k is the number of data blocks and M is the number of coding blocks. Therefore, the size of the storage pool is K+M.
      • That is, the data is stored in K data blocks, and M redundant blocks are provided to provide high availability of data. Then the maximum number of blocks that can fail is m, and the actual disk occupation is K+M blocks. Therefore, compared with the replica pool mechanism, the storage resources are saved. Generally, the 8 + 4 mechanism is adopted, that is, 8 data blocks + 4 redundant blocks, that is, 8 data blocks of 12 data blocks save data, and 4 realize data redundancy, That is, 1 / 3 of the disk space is used for data redundancy, which saves space than three times the redundancy of the default replica pool, but there can be no failure greater than a certain data block.
      • Not all applications support erasure pool. RBD only supports replica pool, while radosgw can support erasure pool.
      • For file system and block storage, Ceph does not recommend using erasure code pool due to the problem of reading and writing performance.

    How to view why a storage pool type:

    $ ceph osd pool get test crush_rule
    crush_rule: erasure-code
  3. Replica pool IO

    • Store a data object as multiple copies.
    • During the client write operation, ceph uses the CRUSH algorithm to calculate the PG ID and primary OSD corresponding to the object. The primary OSD calculates the auxiliary OSD of PG according to the set number of copies, object name, storage pool name and cluster map, and then the OSD resynchronizes the data to the auxiliary OSD.

    Read and write data:

    ## Read data
    1.The client sends a read request, RADOS Send request to master OSD. 
    2.main OSD Read the data from the local disk and return the data, and finally complete the read request.
    ## Write data
    1.client APP Request to write data, RADOS Send data to master OSD. 
    2.main OSD After writing, the completion signal is sent to the client APP,And send data to each copy OSD. 
    3.copy OSD Write data and send a write completion signal to the master OSD. 
  4. Erasure code pool IO

    • Ceph supports erasure code since Firefly version, but it is not recommended to use erasure code pool in production environment.
    • Erasure pool reduces the total disk space required for data storage, but the calculation cost of reading and writing data is higher than that of replica pool. RGW can support erasure pool, but RBD does not.

    Read and write data:

    ## Read data
    1.From the corresponding OSDs Decode after obtaining data in
    2.If data is lost at this time, Ceph It will automatically save the check code from the OSD Read the data in and decode it
    3.Return data after data decoding
    ## Write data
    1.The data will be in the main OSD Encode and distribute to the appropriate OSDs up
    2.Calculate the appropriate data block and encode it
    3.Encode and write each data block OSD
  5. PG and PGP

    • PG = Placement Group # homing group
    • pgp = combination of placement group for placement purpose # homing group. pgp is equivalent to an arrangement and combination relationship between pg and osd.

    Placement group is an internal data structure used to store data in each storage pool across multiple OSDs. PG is an intermediate layer between the osd daemon and the ceph client. The hash algorithm is responsible for dynamically mapping each object to a PG, that is, the primary pg. according to the number of copies in the storage pool (for example, 3), two more copies of each primary PG will be copied out. The CRUSH algorithm is responsible for dynamically mapping three PG to three different osd daemons, and the three PG form a pgp, So as to achieve high availability of multiple copies in osd.

    The file addressing process is roughly shown in the following figure:


Several points needing attention:

  • The number of PGS and PGP S can be customized and is specific to the storage pool, but the total number of PGs will be determined according to the size of the OSD cluster
  • Compared with the storage pool, PG is a virtual component, which is the virtual layer used when objects are mapped to the storage pool
  • In consideration of scale and performance, ceph subdivides the storage pool into multiple PGPS. Each PGP has a primary PG, and the OSD node where the primary PG is located is the primary OSD.
  • When a new OSD node joins the cluster, ceph will recombine the PGP through cross, so that each OSD has data to achieve the data balance of the whole cluster.

Allocation calculation of PG:

  • Official suggestion: the number of pg in each OSD should not exceed 100. The formula is: total PGs = (total_number_of_osd * 100) / max_ replication_ count

  • Specific algorithm: for example, there are now 12 osd machines, and I need to create 20 storage pools.

    At this time, the total number of pg is: 12 * 100 / 3 = 400

    The average allocated pg quantity per storage pool is 400 / 20 = 20

    It is calculated here that 20 pg can be allocated to each storage pool on average, but the number of pg in each storage pool is recommended to be the nth power of 2, so 2, 4, 8, 16, 32, 64, 128, etc. at this time, a simple conversion should be carried out in combination with what the specific pool is to store and how much data can be stored. If this pool only stores some metadata, 4 can be allocated. On the contrary, 16, 32, etc. can be allocated if the amount of data is large.

    In addition, the number of pg in the pool is an integer power of 2, which can also be used, but there will be a warning.

Combination of PG and PGP:

  1. View the number of pg and pgp in the replicapool pool

    $ ceph osd pool get replicapool pg_num 
    pg_num: 32
    $ ceph osd pool get replicapool pgp_num 
    pgp_num: 32

    View the pg and pgp distribution of replicapool pool

    $ ceph pg ls-by-pool replicapool | awk '{print $1,$2,$15}'
    2.0 596 [3,1,0]p3
    2.1 623 [3,4,0]p3
    2.2 570 [3,4,0]p3
    2.3 560 [3,4,0]p3
    2.4 630 [0,3,4]p0
    2.5 574 [4,0,3]p4
    2.6 572 [4,3,0]p4
    2.7 572 [3,4,0]p3
    2.8 622 [3,4,0]p3
    2.9 555 [0,3,4]p0
    2.a 523 [1,3,0]p1
    2.b 574 [4,3,0]p4
    2.c 620 [4,3,0]p4
    2.d 637 [1,3,0]p1
    2.e 522 [0,3,4]p0
    2.f 599 [4,3,0]p4
    2.10 645 [4,3,0]p4
    2.11 534 [3,4,0]p3
    2.12 622 [4,3,0]p4
    2.13 577 [1,3,0]p1
    2.14 661 [3,4,0]p3
    2.15 626 [1,3,0]p1
    2.16 585 [2,4,0]p2
    2.17 610 [3,4,0]p3
    2.18 610 [4,2,0]p4
    2.19 560 [4,3,0]p4
    2.1a 599 [3,4,0]p3
    2.1b 614 [1,2,0]p1
    2.1c 581 [4,3,0]p4
    2.1d 614 [4,3,0]p4
    2.1e 595 [0,3,1]p0
    2.1f 572 [3,4,0]p3
    * NOTE: afterwards

PG status explanation:

When osd expands or shrinks or in some special cases, ceph rebalances the data with pg as a whole. At this time, pg will appear in many different states, such as:

$ ceph -s 
    id:     537175bb-51de-4cc4-9ee3-b5ba8842bff2
    health: HEALTH_WARN
            Degraded data redundancy: 152/813 objects degraded (18.696%), 43 pgs degraded, 141 pgs undersized
    mon: 2 daemons, quorum yq01-aip-aikefu10,bjkjy-feed-superpage-gpu-04 (age 111s)
    mgr: yq01-aip-aikefu10(active, since 11d), standbys: bjkjy-feed-superpage-gpu-04
    mds: mycephfs:1 {0=ceph-node2=up:active} 1 up:standby
    osd: 8 osds: 8 up (since 3d), 8 in (since 3d); 124 remapped pgs
    rgw: 2 daemons active (ceph-node1, ceph-node2)
  task status:
    pools:   8 pools, 265 pgs
    objects: 271 objects, 14 MiB
    usage:   8.1 GiB used, 792 GiB / 800 GiB avail
    pgs:     152/813 objects degraded (18.696%)
             114/813 objects misplaced (14.022%)
             111 active+clean+remapped
             98  active+undersized
             43  active+undersized+degraded
             13  active+clean
  • Clean: in the clean state, PG currently has no objects to be repaired, and the size is equal to the number of copies of the storage pool, that is, the active set and up set of PG are the same group of OSD s and the content is the same.
  • Active: ready state or active state. Active indicates that the primary OSD and standby OSD are in normal working state. At this time, the PG can normally process the read-write requests from the client. The normal PG is in Active+Clean state by default.
  • Peering: in the synchronization state, the OSD in the same PG needs to synchronize the prepared data, and peering (peer) is the state in the process of OSD synchronization.
  • activating: Peering has been completed, and PG is waiting for all PG instances to synchronize Peering results (Info, Log, etc.)
  • Degraded: degraded state. After the OSD is marked as down, other PGs mapped to this OSD will switch to the degraded state. If the OSD is marked as down for more than 5 minutes and has not been repaired, ceph will start the recovery operation on the degraded PG until all PGs degraded by this OSD are restored to the clean state.
  • Rationalized: when the current number of copies of PG is less than the value defined by its storage pool, PG will change to the rationalized state until the backup OSD is added or repaired.
  • Removed: when the PG changes and the data is migrated from the old OSD to the new OSD, it will take a period of time for the new primary OSD to respond to the request. During this period, it will continue to request services from the old primary OSD until the PG migration is completed.
  • scrubbing: scrub is a mechanism used by Ceph to clean data to ensure data integrity. Ceph's OSD regularly starts the scrub thread to scan some objects and find out whether they are consistent by comparing them with other copies. It mainly checks metadata information, such as file name, object attribute, size, etc. if they are different, a copy will be copied from the main pg.
  • Stale: expired status. Under normal status, each OSD will periodically report the latest statistical data of all primary PGs it holds as OSD to the monitor (mon) in the RADOS cluster. If an OSD cannot normally send reporting information to the monitor for any reason, or other OSDs report that an OSD has been down, all primary PGs based on this OSD will be immediately marked as stale status, That is, their main OSD is not the latest data.
  • Recovering: recovering. The cluster is migrating or synchronizing objects and their replicas. This may be because a new OSD is added to the cluster, or after an OSD goes down, the PG is reassigned to different OSDs by the cross algorithm, and the PG has an internal data synchronization process.
  • Backfilling: background filling status. backfill is a special scenario of recovery. After peering is completed, if incremental synchronization cannot be implemented for some PG instances in the up set based on the current authoritative log (for example, the OSD hosting these PG instances is offline for too long, or the overall migration of PG instances is caused by the addition of a new OSD to the cluster), full synchronization will be carried out by completely copying all objects in the current Primary, PG in this process will be in backfilling.
  • Backfill toofull: the state given by PG when the backfill process is currently suspended due to insufficient available space in the OSD of a PG instance that needs to be backfilled.
  • Creating: creates the status in PG, which usually occurs when creating a new pool.
  • incomplete: during Peering, the authoritative log cannot be selected or selected through choos_ The act selected by acting is not enough to complete data recovery (for example, for erasure codes, the number of surviving copies is less than k value), which makes Peering unable to complete normally. That is, the pg metadata is lost and the pg state cannot be restored. (CEPH objectstore tool can adjust this status pg to complete)
  1. noscrub and nodeep Scrub

    • noscrub: lightweight data scanning, which mainly checks whether the metadata information is consistent. If it is inconsistent, it will be synchronized. Generally, it is compared once a day. It is enabled by default.
    • Nodeep scrub: data depth scanning, full scanning of all data, including metadata, object s, etc. generally once a week. It is enabled by default.

    During data verification, the reading pressure will increase. If the scanned data is inconsistent, synchronous writing will be carried out to increase the writing pressure. Therefore, during capacity expansion and other operations, we will artificially set noscrub and nodeep scrub to suspend data verification. Check whether the pool starts cleaning:

    $ ceph osd pool get replicapool noscrub
    noscrub: false
    $ ceph osd pool get replicapool nodeep-scrub
    nodeep-scrub: false
  2. data compression

    If BlueStore storage engine is used, ceph supports the function called "real-time data compression", that is, saving data while compressing. This function helps to save disk space. Compression can be enabled or disabled on each storage pool created on BlueStore OSD to save disk space. Compression is not enabled by default. It needs to be configured and enabled later:

    ## Turn on the compression function
    $ ceph osd pool set <pool name> compression_algorithm <compression algorithm> 
    Algorithm Introduction:
      sanppy: Default algorithm, consumption cpu less
      zstd: Compression ratio is good, but consumption CPU
      lz4: consume cpu less
      zlib: Not recommended
    $ ceph osd pool set replicapool compression_algorithm snappy
    set pool 2 compression_algorithm to snappy
    ## Specify compression mode
    $ ceph osd pool set <pool name> compression_mode <Specify mode>
    Mode introduction:
      none:Never compress data, default.
      passive:Do not compress data unless the write operation has a compressible prompt set.
      aggressive:Compress data unless the write operation has an uncompressed prompt set.
      force:Try to compress the data anyway, even if the client implies that the data is incompressible, that is, use compression in all cases
    $ ceph osd pool set replicapool compression_mode passive
    set pool 2 compression_mode to passive

    Global compression options, which can be configured to CEPH Conf for all storage pools:

    bluestore_compression_algorithm #compression algorithm
    bluestore_compression_mode      #Compression mode
    bluestore_compression_required_ratio #The compression ratio after compression and before compression is. 875 by default
    bluestore_compression_min_blob_size  #Blocks smaller than it will not be compressed. The default is 0
    bluestore_compression_max_blob_size  #Blocks larger than it will be split into smaller blocks before compression. The default is 0
    bluestore_compression_min_blob_size_ssd #Default 8k
    bluestore_compression_max_blob_size_ssd #Default 64k
    bluestore_compression_min_blob_size_hdd #Default 128k
    bluestore_compression_max_blob_size_hdd #Default 512k

    Enabling this function will affect cpu utilization. If the environment is a production environment, it is not recommended to enable this function.

ceph cluster management command

  1. Storage pool basic management

    Create storage pool, format example

    $ ceph osd pool create <poolname> pg_num pgp_num {replicated|erasure}
    $ ceph osd pool create study 8 8 
    pool 'study' created

    List storage pools

    $ ceph osd lspools
    1 .rgw.root
    2 study

    Rename storage pool, format example

    $ ceph osd pool rename old-name new-name 
    $ ceph osd pool rename study re-study
    pool 'study' renamed to 're-study'

    Display storage pool usage information

    $ rados df 
    $ ceph osd df 

    Delete storage pool

    ## 1. In order to prevent the storage pool from being deleted by mistake, ceph sets up two mechanisms to protect it. First, the nodelete flag of the storage pool should be false
    $ ceph osd pool set re-study nodelete false
    set pool 13 nodelete to true
    $ ceph osd pool get re-study nodelete
    nodelete: false
    ## 2. The second mechanism is to set Mon to allow deletion -- mon allow pool delete = true
    $ ceph tell mon.* injectargs --mon-allow-pool-delete=true 
    injectargs:mon_allow_pool_delete = 'true' 
    ## 3. When deleting, write the names of the storage pools on both sides and add the parameter -- yes-i-really-really-mean-it
    $ ceph osd pool rm re-study re-study --yes-i-really-really-mean-it
    pool 're-study' removed
  2. Storage pool quota

    The storage pool can set two paired storage objects to limit. One quota is the maximum space (max_bytes) and the other quota is the maximum number of objects (max_objects). It will not be limited by default, for example:

    ## Check the quota of replicapool storage pool. N/A means unlimited
    $ ceph osd pool get-quota replicapool 
    quotas for pool 'replicapool':
      max objects: N/A
      max bytes  : N/A
    ## The maximum number of objects is 1000 and the maximum byte is 1000000000
    $ ceph osd pool set-quota replicapool max_objects 1000
    set-quota max_objects = 1000 for pool replicapool
    $ ceph osd pool set-quota replicapool max_bytes 1000000000
    set-quota max_bytes = 1000000000 for pool replicapool
    $ ceph osd pool get-quota replicapool 
    quotas for pool 'replicapool':
      max objects: 1k objects
      max bytes  : 954 MiB
    ## It can be set to no limit
    $ ceph osd pool set-quota replicapool max_bytes 0
  3. Common parameters of storage pool

    View the number and minimum number of copies of storage pool objects

    $ ceph osd pool get replicapool size
    size: 1
    $ ceph osd pool get replicapool min_size 
    min_size: 1

    min_size: the minimum number of replicas required to provide services. The default value is 2, which means that if one of the three replicas of a storage pool's osd is broken, then there are two replicas left to work normally. However, if another one is broken and only one replica is left, then the storage pool cannot provide services normally.

    View the number of storage pools pg and pgp

    $ ceph osd pool get replicapool pg_num 
    pg_num: 32
    $ ceph osd pool get replicapool pgp_num 
    pgp_num: 32

    Controls whether the pg, pgp, and storage size of the storage pool can be changed

    $ ceph osd pool get replicapool nopgchange
    nopgchange: false
    $ ceph osd pool get replicapool nosizechange
    nosizechange: false

    Lightweight scanning and deep scanning management

    ## Turn off light scan and deep scan
    $ ceph osd pool set replicapool noscrub true
    $ ceph osd pool set replicapool nodeep-scrub true
    ## The minimum and maximum scanning intervals are not set by default. If necessary, they should be specified in the configuration file
    osd_scrub_min_interval xxx
    osd_scrub_max_interval xxx
    osd_deep_scrub_interval xxx

    Viewing the default configuration of ceph osd

    $ ceph daemon osd.1 config show | grep scrub

Tags: Ceph

Posted by Cloud9247 on Thu, 05 May 2022 14:52:57 +0300