提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档
文章目录
- 前言
- 一、环境信息
- 二、部署步骤
- 2.1 基础环境准备
- 2.2 各节点docker环境安装
- 2.3 搭建互信集群
- 2.4 下载ceph-ansible
- 三、配置部署文件
- 3.1 使用本地docker
- 3.2 配置hosts主机清单文件
- 3.3 配置group_vars/all.yml文件
- 3.4 开始部署
- 3.5 部署ceph-common软件包
- 3.6 部署结果
- 四、相关实验
- 4.1 测试删除osd
- 4.2 测试增加osd
- 4.3 将实验4.1中移除的osd更换硬盘后重新加回集群
- 4.4 新增一个只是osd功能的节点
- 4.5 删除新增的node04节点
- 总结
前言
记录一下使用ceph-ansible部署ceph14版本(nautilus)的过程。
ceph-ansible官网地址:https://docs.ceph.com/projects/ceph-ansible/en/latest/osds/scenarios.html
一、环境信息
操作系统版本:centos7.9
机器-磁盘信息表格
机器名称 | 机器IP | 磁盘一盘符 | 磁盘二盘符 | 磁盘三盘符 | 磁盘四盘符 | 磁盘五盘符 |
---|---|---|---|---|---|---|
node01 | 192.168.150.72 | /dev/vdb/ | /dev/vdc/ | /dev/vdd/ | ||
node02 | 192.168.150.73 | /dev/vdb/ | /dev/vdc/ | /dev/vdd/ | /dev/vde/ | |
node03 | 192.168.150.74 | /dev/vdb/ | /dev/vdc/ | /dev/vdd/ | /dev/vde/ | /dev/vdf/ |
二、部署步骤
2.1 基础环境准备
基础环境的部署参考
https://blog.csdn.net/baidu_35848778/article/details/145564790
2.2 各节点docker环境安装
我这里的docker配置了自己本地的harbor仓库,镜像都是本地化了的,现在国内源pull不太好使,最好是能自己提前下载好本地化一下来使用。
# 安装docker服务
sudo yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo;
# 修改cgroupdriver
mkdir -p /etc/docker/;
cat > /etc/docker/daemon.json <<EOF
{"insecure-registries": ["http://harbor.XXX.XX.XX:10002"],"exec-opts":["native.cgroupdriver=systemd"],"log-driver":"json-file","log-opts":{"max-size":"100m"}
}
EOF# 安装软件包
yum install docker-ce docker-ce-cli -y;
# 启动服务,设置自启动
systemctl restart docker;
systemctl enable docker;# 登录仓库
docker login http://harbor.XXX.XX.XX:10002
2.3 搭建互信集群
搭建互信的方式各不相同,我这边使用的是收集分发authorized_keys的方式。
各节点修改/etc/hosts文件
cat <<EOF >> /etc/hosts
192.168.150.72 node01
192.168.150.73 node02
192.168.150.74 node03
EOF
各节点生成密钥
ssh-keygen -f ~/.ssh/id_rsa -P '' -q
主节点(72节点)发送密钥到各个节点
yum install -y sshpasssshpass -p "password" ssh-copy-id -i /root/.ssh/id_rsa.pub -o StrictHostKeyChecking=no root@192.168.150.72sshpass -p "password" ssh-copy-id -i /root/.ssh/id_rsa.pub -o StrictHostKeyChecking=no root@192.168.150.73sshpass -p "password" ssh-copy-id -i /root/.ssh/id_rsa.pub -o StrictHostKeyChecking=no root@192.168.150.74
主节点(72节点)收集各节点密钥
ssh root@192.168.150.73 cat ~/.ssh/id_rsa.pub>> /root/.ssh/authorized_keysssh root@192.168.150.74 cat ~/.ssh/id_rsa.pub>> /root/.ssh/authorized_keys
主节点(72节点)推送密钥汇集文件到各个节点
scp /root/.ssh/authorized_keys 192.168.150.73:/root/.ssh/scp /root/.ssh/authorized_keys 192.168.150.74:/root/.ssh/
2.4 下载ceph-ansible
下载安装包 国内不好访问的话 我是直接买了一个阿里云的香港的抢占式虚拟机下载的
yum install python2-pip ansible git python-netaddr -y
mkdir -p /data/installceph/ && cd /data/installceph/
git config --global http.postBuffer 5242880
git clone https://github.com/ceph/ceph-ansible.git
cd ceph-ansible
# 切换分支,需要部署的是14 nautilus版本
git checkout stable-4.0
相关版本信息
stable-3.0 Supports Ceph versions jewel and luminous. This branch requires Ansible version 2.4.
stable-3.1 Supports Ceph versions luminous and mimic. This branch requires Ansible version 2.4.
stable-3.2 Supports Ceph versions luminous and mimic. This branch requires Ansible version 2.6.
stable-4.0 Supports Ceph version nautilus. This branch requires Ansible version 2.9.
stable-5.0 Supports Ceph version octopus. This branch requires Ansible version 2.9.
stable-6.0 Supports Ceph version pacific. This branch requires Ansible version 2.10.
stable-7.0 Supports Ceph version quincy. This branch requires Ansible version 2.12.
main Supports the main (devel) branch of Ceph. This branch requires Ansible version 2.12.
三、配置部署文件
3.1 使用本地docker
/data/installceph/ceph-ansible/roles/ceph-container-engine/tasks/pre_requisites/prerequisites.yml
#- name: install container packages
# package:
# name: ['{{ container_package_name }}', '{{ container_binding_name }}']
# update_cache: true
# register: result
# until: result is succeeded
# tags: with_pkg
3.2 配置hosts主机清单文件
由于是个各节点的硬盘信息不同
cat <<EOF >> /data/installceph/ceph-ansible/hosts
[mons]
node01
node02
node03[osds]
node01 devices="['/dev/vdb','/dev/vdc','/dev/vdd']"
node02 devices="['/dev/vdb','/dev/vdc','/dev/vdd','/dev/vde']"
node03 devices="['/dev/vdb','/dev/vdc','/dev/vdd','/dev/vde','/dev/vdf']"[mgrs]
node01
node02
node03[mdss]
node01
node02
node03[clients]
node01[rgws]
node01[grafana-server]
node01
EOF
3.3 配置group_vars/all.yml文件
\cp /data/installceph/ceph-ansible/group_vars/all.yml.sample /data/installceph/ceph-ansible/group_vars/all.yml
cat <<EOF >> /data/installceph/ceph-ansible/group_vars/all.yml######################################################
# INSTALL OPTIONS BY USER #
# #
####################################################### Install options
# -----------------------------
ceph_origin: repository
ceph_repository: community
ceph_mirror: http://mirrors.aliyun.com/ceph
ceph_stable_key: http://mirrors.aliyun.com/ceph/keys/release.asc
ceph_stable_release: nautilus
ceph_stable_repo: "{{ ceph_mirror }}/rpm-{{ ceph_stable_release }}"
# -----------------------------ceph_docker_registry: harbor.XXX.XX.XX:10002
#node_exporter_container_image: "prom/node-exporter:v0.17.0"
#grafana_container_image: "grafana/grafana:5.4.3"
#prometheus_container_image: "prom/prometheus:v2.7.2"
#alertmanager_container_image: "prom/alertmanager:v0.16.2"# Ceph options
# -----------------------------
generate_fsid: true
ceph_conf_key_directory: /etc/ceph
cephx: true
# -----------------------------# Client options
# -----------------------------
rbd_cache: "false"
rbd_client_log_path: /var/log/ceph
# ----------------------------# Monitor options
# -----------------------------
monitor_interface: eth0
# ----------------------------# OSD options
# -----------------------------
journal_size: 5120
public_network: 192.168.150.0/24
cluster_network: 192.168.150.0/24
osd_objectstore: bluestore
# -----------------------------# MDS options
# -----------------------------
radosgw_interface: eth0
# -----------------------------# Testing mode
# -----------------------------
#common_single_host_mode: true
# -----------------------------# DOCKER options
# -----------------------------
ceph_docker_image: "ceph/daemon"
ceph_docker_image_tag: latest-nautilus
containerized_deployment: true
# -----------------------------# DASHBOARD options
# -----------------------------
dashboard_enabled: False
dashboard_protocol: http
dashboard_port: 8443
dashboard_admin_user: admin
dashboard_admin_password: admin123456
grafana_admin_user: admin
grafana_admin_password: admin
# -----------------------------
EOF
3.4 开始部署
cp site-docker.yml.sample site-docker.ymlansible-playbook -i /data/installceph/ceph-ansible/hosts /data/installceph/ceph-ansible/site-docker.yml
3.5 部署ceph-common软件包
因为更习惯于在本地执行ceph命令,所以安装ceph-common
yum install epel-release -ycat <<END >/etc/yum.repos.d/ceph.repo
[Ceph]
name=Ceph packages for \$basearch
baseurl=http://mirrors.aliyun.com/ceph/rpm-nautilus/el7/\$basearch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc[Ceph-noarch]
name=Ceph noarch packages
baseurl=http://mirrors.aliyun.com/ceph/rpm-nautilus/el7/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc[ceph-source]
name=Ceph source packages
baseurl=http://mirrors.aliyun.com/ceph/rpm-nautilus/el7/SRPMS
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
ENDyum clean all
yum makecacheyum install -y ceph-common.x86_64
3.6 部署结果
osd部署结果符合预期
[root@node01 ceph-ansible]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 1.17224 root default
-3 0.29306 host node01 0 hdd 0.09769 osd.0 up 1.00000 1.00000 3 hdd 0.09769 osd.3 up 1.00000 1.00000 6 hdd 0.09769 osd.6 up 1.00000 1.00000
-7 0.39075 host node02 1 hdd 0.09769 osd.1 up 1.00000 1.00000 4 hdd 0.09769 osd.4 up 1.00000 1.00000 7 hdd 0.09769 osd.7 up 1.00000 1.00000 9 hdd 0.09769 osd.9 up 1.00000 1.00000
-5 0.48843 host node03 2 hdd 0.09769 osd.2 up 1.00000 1.00000 5 hdd 0.09769 osd.5 up 1.00000 1.00000 8 hdd 0.09769 osd.8 up 1.00000 1.00000
10 hdd 0.09769 osd.10 up 1.00000 1.00000
11 hdd 0.09769 osd.11 up 1.00000 1.00000
四、相关实验
4.1 测试删除osd
实验设计:模拟osd.11异常无法提供服务时的移除操作
# 命令e.g.:ansible-playbook -vv -i hosts infrastructure-playbooks/shrink-osd.yml -e osd_to_kill=1,2,3# 实验命令:ansible-playbook -vv -i hosts infrastructure-playbooks/shrink-osd.yml -e osd_to_kill=11
实验结果
Thursday 13 February 2025 15:33:26 +0800 (0:00:00.373) 0:00:31.086 *****
ok: [node01] => changed=false cmd:- docker- exec- ceph-mon-node01- ceph- --cluster- ceph- -sdelta: '0:00:00.547188'end: '2025-02-13 15:33:27.087717'rc: 0start: '2025-02-13 15:33:26.540529'stderr: ''stderr_lines: <omitted>stdout: |2-cluster:id: 84a44515-64c1-4f5c-b9c5-a0cc3e797074health: HEALTH_WARNDegraded data redundancy: 28/627 objects degraded (4.466%), 7 pgs degradedservices:mon: 3 daemons, quorum node01,node02,node03 (age 76m)mgr: node02(active, since 74m), standbys: node01, node03mds: cephfs:1 {0=node03=up:active} 2 up:standbyosd: 11 osds: 11 up (since 14s), 11 in (since 16s); 1 remapped pgsrgw: 1 daemon active (node01.rgw0)task status:data:pools: 6 pools, 144 pgsobjects: 209 objects, 3.4 KiBusage: 11 GiB used, 1.1 TiB / 1.1 TiB availpgs: 28/627 objects degraded (4.466%)135 active+clean3 active+recovery_wait+degraded3 active+recovering+degraded2 active+recovery_wait1 active+recovery_wait+undersized+degraded+remappedio:recovery: 3 B/s, 1 keys/s, 2 objects/sprogress:Rebalancing after osd.11 marked out[==================............]stdout_lines: <omitted>TASK [show ceph osd tree] **************************************************************************************************************************************************
task path: /data/installceph/ceph-ansible/infrastructure-playbooks/shrink-osd.yml:254
Thursday 13 February 2025 15:33:27 +0800 (0:00:00.999) 0:00:32.085 *****
ok: [node01] => changed=false cmd:- docker- exec- ceph-mon-node01- ceph- --cluster- ceph- osd- treedelta: '0:00:00.560455'end: '2025-02-13 15:33:28.017771'rc: 0start: '2025-02-13 15:33:27.457316'stderr: ''stderr_lines: <omitted>stdout: |-ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF-1 1.07455 root default-3 0.29306 host node010 hdd 0.09769 osd.0 up 1.00000 1.000003 hdd 0.09769 osd.3 up 1.00000 1.000006 hdd 0.09769 osd.6 up 1.00000 1.00000-7 0.39075 host node021 hdd 0.09769 osd.1 up 1.00000 1.000004 hdd 0.09769 osd.4 up 1.00000 1.000007 hdd 0.09769 osd.7 up 1.00000 1.000009 hdd 0.09769 osd.9 up 1.00000 1.00000-5 0.39075 host node032 hdd 0.09769 osd.2 up 1.00000 1.000005 hdd 0.09769 osd.5 up 1.00000 1.000008 hdd 0.09769 osd.8 up 1.00000 1.0000010 hdd 0.09769 osd.10 up 1.00000 1.00000stdout_lines: <omitted>
META: ran handlersPLAY RECAP *****************************************************************************************************************************************************************
node01 : ok=19 changed=3 unreachable=0 failed=0 skipped=12 rescued=0 ignored=0
node02 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
node03 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
移除完毕后,将主机配置文件中osd对应的硬盘信息移除
[osds]
node01 devices="['/dev/vdb','/dev/vdc','/dev/vdd']"
node02 devices="['/dev/vdb','/dev/vdc','/dev/vdd','/dev/vde']"
# 移除osd11前的记录node03 devices="['/dev/vdb','/dev/vdc','/dev/vdd','/dev/vde','/dev/vdf']"
# 下列为移除osd11之后的记录
node03 devices="['/dev/vdb','/dev/vdc','/dev/vdd','/dev/vde']"
4.2 测试增加osd
实验设计:在node01节点增加一个新硬盘名为/dev/vde的osd
将主机配置文件中新增osd对应的硬盘信息
[osds]
node01 devices="['/dev/vdb','/dev/vdc','/dev/vdd','/dev/vde']"
node02 devices="['/dev/vdb','/dev/vdc','/dev/vdd','/dev/vde']"
# 移除osd11前的记录node03 devices="['/dev/vdb','/dev/vdc','/dev/vdd','/dev/vde','/dev/vdf']"
# 下列为移除osd11之后的记录
node03 devices="['/dev/vdb','/dev/vdc','/dev/vdd','/dev/vde']"
执行命令
命令e.g.:ansible-playbook -i /data/installceph/ceph-ansible/hosts site-docker.yml --limit osd-node-name实验命令:ansible-playbook -i /data/installceph/ceph-ansible/hosts site-docker.yml --limit node01
实验结果:
[root@node01 ceph-ansible]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 1.17224 root default
-3 0.39075 host node01 0 hdd 0.09769 osd.0 up 1.00000 1.00000 3 hdd 0.09769 osd.3 up 1.00000 1.00000 6 hdd 0.09769 osd.6 up 1.00000 1.00000
11 hdd 0.09769 osd.11 up 1.00000 1.00000
-7 0.39075 host node02 1 hdd 0.09769 osd.1 up 1.00000 1.00000 4 hdd 0.09769 osd.4 up 1.00000 1.00000 7 hdd 0.09769 osd.7 up 1.00000 1.00000 9 hdd 0.09769 osd.9 up 1.00000 1.00000
-5 0.39075 host node03 2 hdd 0.09769 osd.2 up 1.00000 1.00000 5 hdd 0.09769 osd.5 up 1.00000 1.00000 8 hdd 0.09769 osd.8 up 1.00000 1.00000
10 hdd 0.09769 osd.10 up 1.00000 1.00000
4.3 将实验4.1中移除的osd更换硬盘后重新加回集群
将主机配置文件中新增osd对应的硬盘信息
[osds]
node01 devices="['/dev/vdb','/dev/vdc','/dev/vdd','/dev/vde']"
node02 devices="['/dev/vdb','/dev/vdc','/dev/vdd','/dev/vde']"
node03 devices="['/dev/vdb','/dev/vdc','/dev/vdd','/dev/vde','/dev/vdf']"
执行命令
命令e.g.:ansible-playbook -i /data/installceph/ceph-ansible/hosts site-docker.yml --limit osd-node-name实验命令:ansible-playbook -i /data/installceph/ceph-ansible/hosts site-docker.yml --limit node03
实验结果
[root@node01 ceph-ansible]# ceph -s cluster:id: 84a44515-64c1-4f5c-b9c5-a0cc3e797074health: HEALTH_OKservices:mon: 3 daemons, quorum node01,node02,node03 (age 27m)mgr: node02(active, since 2h), standbys: node01, node03mds: cephfs:1 {0=node02=up:active} 2 up:standbyosd: 13 osds: 13 up (since 69s), 13 in (since 69s)rgw: 1 daemon active (node01.rgw0)task status:data:pools: 6 pools, 144 pgsobjects: 209 objects, 3.4 KiBusage: 13 GiB used, 1.3 TiB / 1.3 TiB availpgs: 144 active+clean[root@node01 ceph-ansible]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 1.26993 root default
-3 0.39075 host node01 0 hdd 0.09769 osd.0 up 1.00000 1.00000 3 hdd 0.09769 osd.3 up 1.00000 1.00000 6 hdd 0.09769 osd.6 up 1.00000 1.00000
11 hdd 0.09769 osd.11 up 1.00000 1.00000
-7 0.39075 host node02 1 hdd 0.09769 osd.1 up 1.00000 1.00000 4 hdd 0.09769 osd.4 up 1.00000 1.00000 7 hdd 0.09769 osd.7 up 1.00000 1.00000 9 hdd 0.09769 osd.9 up 1.00000 1.00000
-5 0.48843 host node03 2 hdd 0.09769 osd.2 up 1.00000 1.00000 5 hdd 0.09769 osd.5 up 1.00000 1.00000 8 hdd 0.09769 osd.8 up 1.00000 1.00000
10 hdd 0.09769 osd.10 up 1.00000 1.00000
12 hdd 0.09769 osd.12 up 1.00000 1.00000
4.4 新增一个只是osd功能的节点
前提:先把基础环境安装好,然后进行互信集群的扩容,我这边就不展示互信的操作了。
将主机配置文件中新增osd节点及对应的硬盘信息
[osds]
node01 devices="['/dev/vdb','/dev/vdc','/dev/vdd','/dev/vde']"
node02 devices="['/dev/vdb','/dev/vdc','/dev/vdd','/dev/vde']"
node03 devices="['/dev/vdb','/dev/vdc','/dev/vdd','/dev/vde','/dev/vdf']"
node04 devices="['/dev/vdb','/dev/vdc','/dev/vdd','/dev/vde']"
执行命令
命令e.g.:ansible-playbook -i /data/installceph/ceph-ansible/hosts site-docker.yml --limit osd-node-name实验命令:ansible-playbook -i /data/installceph/ceph-ansible/hosts site-docker.yml --limit node04
实验结果:
[root@node01 ceph-ansible]# ceph -s cluster:id: 84a44515-64c1-4f5c-b9c5-a0cc3e797074health: HEALTH_OKservices:mon: 3 daemons, quorum node01,node02,node03 (age 63s)mgr: node02(active, since 2h), standbys: node01, node03mds: cephfs:1 {0=node02=up:active} 2 up:standbyosd: 17 osds: 17 up (since 111s), 17 in (since 111s)rgw: 1 daemon active (node01.rgw0)task status:data:pools: 6 pools, 144 pgsobjects: 209 objects, 3.4 KiBusage: 17 GiB used, 1.6 TiB / 1.7 TiB availpgs: 144 active+clean[root@node01 ceph-ansible]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 1.66068 root default
-3 0.39075 host node01 0 hdd 0.09769 osd.0 up 1.00000 1.00000 3 hdd 0.09769 osd.3 up 1.00000 1.00000 6 hdd 0.09769 osd.6 up 1.00000 1.00000
11 hdd 0.09769 osd.11 up 1.00000 1.00000
-7 0.39075 host node02 1 hdd 0.09769 osd.1 up 1.00000 1.00000 4 hdd 0.09769 osd.4 up 1.00000 1.00000 7 hdd 0.09769 osd.7 up 1.00000 1.00000 9 hdd 0.09769 osd.9 up 1.00000 1.00000
-5 0.48843 host node03 2 hdd 0.09769 osd.2 up 1.00000 1.00000 5 hdd 0.09769 osd.5 up 1.00000 1.00000 8 hdd 0.09769 osd.8 up 1.00000 1.00000
10 hdd 0.09769 osd.10 up 1.00000 1.00000
12 hdd 0.09769 osd.12 up 1.00000 1.00000
-9 0.39075 host node04
13 hdd 0.09769 osd.13 up 1.00000 1.00000
14 hdd 0.09769 osd.14 up 1.00000 1.00000
15 hdd 0.09769 osd.15 up 1.00000 1.00000
16 hdd 0.09769 osd.16 up 1.00000 1.00000
4.5 删除新增的node04节点
实验设计:先删除node04节点上的全部osd,再删除掉host node04
执行命令
命令e.g.:ansible-playbook -vv -i hosts infrastructure-playbooks/shrink-osd.yml -e osd_to_kill=1,2,3ansible-playbook -vv -i hosts infrastructure-playbooks/shrink-osd.yml -e osd_to_kill=13,14,15,16
实验结果:
osd都删除掉了,但是这个host还在,在playbook列表里面也没有找到类似的playbook,个人猜测可能是版本较早,且这个功能场景不太常见的原因。
[root@node01 ceph-ansible]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 1.26993 root default
-3 0.39075 host node01 0 hdd 0.09769 osd.0 up 1.00000 1.00000 3 hdd 0.09769 osd.3 up 1.00000 1.00000 6 hdd 0.09769 osd.6 up 1.00000 1.00000
11 hdd 0.09769 osd.11 up 1.00000 1.00000
-7 0.39075 host node02 1 hdd 0.09769 osd.1 up 1.00000 1.00000 4 hdd 0.09769 osd.4 up 1.00000 1.00000 7 hdd 0.09769 osd.7 up 1.00000 1.00000 9 hdd 0.09769 osd.9 up 1.00000 1.00000
-5 0.48843 host node03 2 hdd 0.09769 osd.2 up 1.00000 1.00000 5 hdd 0.09769 osd.5 up 1.00000 1.00000 8 hdd 0.09769 osd.8 up 1.00000 1.00000
10 hdd 0.09769 osd.10 up 1.00000 1.00000
12 hdd 0.09769 osd.12 up 1.00000 1.00000
-9 0 host node04
总结
记录一下