Ceph issue 해결 사례

오픈인프라데이
Ceph issue 사례
2019.07.11
오픈소스컨설팅 이 영주
( yjlee@osci.kr )

Contents
01. 구성도
02. Issue 발생
03. 해결 과정

01. 구성도
●
전체 구성도
Controller Node
Compute NodeStorage Node
Deploy
FireWall
Router

5/26
01. 구성도
●
Ceph 구성도
ceph-osd1
...
ceph-osd2
...
ceph-osd3
...
...osd.0 osd.1 osd.2 osd.3 osd.4 osd.5 osd.6 osd.7 osd.8
ceph-mon1 ceph-mon2 ceph-mon3

6/26
01. 구성도
●
Ceph OBJ 흐름
PG: Placement Group
Object를 저장하기 위한 OSD의 group.
복제본 수에 맞춰 member의 수가 달라짐.
OSD: Object Storage Daemon
object를 최종 저장하는 곳
Monitor: ceph OSD의
변화를 monitoring
하여 crush map을
만드는 주체 주체.
[root@ceph-osd01 ~]# rados ls -p vms
rbd_data.1735e637a64d5.0000000000000000
rbd_header.1735e637a64d5
rbd_directory
rbd_children
rbd_info
rbd_data.1735e637a64d5.0000000000000003
rbd_data.1735e637a64d5.0000000000000002
rbd_id.893f4f3d-f6d9-4521-997c-72caa861ac24_disk
rbd_data.1735e637a64d5.0000000000000001
rbd_object_map.1735e637a64d5
[root@ceph-osd01 ~]#
OBJ의 기본 크기는 크기는 주체4MBMB
CRUSH: Controlled Replication Under
Scalable Hashing
Object를 분산 저장하기위한 알고리즘.

02. Issue 발생
Ceph 구성도
ceph-osd1
...
ceph-osd2
...
ceph-osd3
...
OSD 중 1개가 90%가 되어
Read/Write가 안되는 주체 상태
[root@osc-ceph01 ~]# ceph pg dump |grep -i full_ratio
dumped all in format plain
full_ratio 0.9
nearfull_ratio 0.8
[root@osc-ceph01 ~]# ceph daemon mon.`hostname` config show |grep -i osd_full_ratio
"mon_osd_full_ratio": "0.9",
[root@osc-ceph01 ~]#

02. Issue 발생
- Ceph community Trouble shooting guide
참조 : http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-osd/#no-free-drive-space

02. Issue 발생
Ceph 구성도
ceph-osd1
...
ceph-osd2
...
ceph-osd3
...
osd.8의 pg 1.11f를 삭제

02. Issue 발생
[root@ceph-mon02 ~]# ceph -s
cluster f5078395-0236-47fd-ad02-8a6daadc7475
health HEALTH_ERR
1 pgs are stuck inactive for more than 300 seconds
162 pgs backfill_wait
37 pgs backfilling
322 pgs degraded
1 pgs down
2 pgs peering
4 pgs recovering
119 pgs recovery_wait
1 pgs stuck inactive
322 pgs stuck unclean
199 pgs undersized
recovery 592647/43243812 objects degraded (1.370%)
recovery 488046/43243812 objects misplaced (1.129%)
1 mons down, quorum 0,2 ceph-mon01,ceph-mon03
monmap e1: 3 mons at {ceph-mon01=10.10.50.201:6789/0,ceph-mon02=10.10.50.202:6789/0,ceph-mon03=10.10.50.203:6789/0}
election epoch 480, quorum 0,2 ceph-mon01,ceph-mon03
osdmap e27606: 128 osds: 125 up, 125 in; 198 remapped pgs
flags sortbitwise
pgmap v58287759: 10240 pgs, 4 pools, 54316 GB data, 14076 kobjects
157 TB used, 71440 GB / 227 TB avail
592647/43243812 objects degraded (1.370%)
488046/43243812 objects misplaced (1.129%)
9916 active+clean
162 active+undersized+degraded+remapped+wait_backfill
119 active+recovery_wait+degraded
37 active+undersized+degraded+remapped+backfilling
4 active+recovering+degraded
1 down+peering
1 peering
300초 넘게 통신이 안되는 pg가 1개 (1.11f) ...
osd가 down되어 backfill을 기다리고 있는 pg가 162개
pglog의 범위를 벗어나
backfill을 진행 하고 있는 pg가 37개
3copy를 채우지 못해 성능이 떨어진 pg가 322개
문제의 down된 pg 1개... (1.11f) )
상태를 결정중인 pg 2개
(recovery, backfill)
recovery를 기다리고 있는 pg 119개
pglog를 보고 복구중인 pg 4개
(해당 pg I/O block됨)
up상태의 osd가 없어서 inactive 된 pg 1개
(1.11f)
3벌 복제에 못미치는 pg가 322개
pool의 복제본 수에 못미치는 pg가 199개
Monitor 1개 죽음
pg 1.11f를 갖고 있는
OSD 3개 죽음

02. Issue 발생
구성도
ceph-osd1
...
ceph-osd2
...
ceph-osd3
...
Images Pool
Openstack image들이
들어가 있음.
Volumes Pool
pg 1.11f는 모든 openstack
volume들의 정보를
조금씩 갖고 있음.
pg 1개가 down되면 해당 pool의
모든 data들을 쓸 수가 없다.
[root@osc-ceph01 ~]# ceph pg dump |head
dumped all in format plain
...
pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary
last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
1.11f 0 0 0 0 0 0 3080 3080 active+clean 2019-07-10 08:12:46.623592 921'8580 10763:10591 [8,4,7] 8 [8,4,7] 8
921'8580 2019-07-10 08:12:46.623572 921'8580 2019-07-07 19:44:32.652377
...
Primary pg가
모든 I/O를 책임진다.

03. 해결 과정
writeout_from: 30174'649621, trimmed:
-1> 2018-10-24 15:28:44.487997 7fb622e2d700 5 write_log with: dirty_to: 0'0,
dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: false, divergent_priors: 0,
writeout_from: 30174'593316, trimmed:
0> 2018-10-24 15:28:44.502006 7fb61de23700 -1 osd/SnapMapper.cc: In function
'void SnapMapper::add_oid(const hobject_t&,
const std::set<snapid_t>&, MapCacher::Transaction<std::basic_string<char>, ceph::buffer::list>*)'
thread 7fb61de23700 time 2018-10-24 15:28:44.497739
osd/SnapMapper.cc: 228: FAILED assert(r == -2)
분석 결과...
3벌 복제된 pg간 충돌이나서 해당 pg를
갖고 있는 osd가 down된다.
이것은 redhat ceph 3.1(luminous)에서 fix되었으니
upgrade를 해라!!
그러나...
- Redhat Openstack 9(Mitaka)는 Redhat ceph 3.1을 지원 안함.
- Redhat ceph 3.1로 upgrade하기전에 openstack을
10(Newton)까지 upgrade 필요.
- Redhat openstack 9는 TripleO로 되어져 있음.
(Upgrade process가 굉장히 복잡함...)
- Redhat ceph upgrade 시 Error상태에서 해야 함.
- 렁나ㅣ러아니ㅓㄹ아ㅣㄴ;ㅓㄹ아ㅣ;ㄴ며랴ㅓ야냋

03. 해결 과정
Openstack upgrade
- 실패...
- 재설치 후 모든 후 모든 모든 vm 복구
Ceph 3.1 upgrade
- ceph ansible을 사용하지 않고 않고 manualy upgrade 함.

03. 해결 과정
ceph-osd1
...
ceph-osd2
...
ceph-osd3
...
vms Pool
Nova에 의해 생성되는
vm image를 저장
12345_disk 기존 vm rbd
67890_disk 신규 vm rbd
신규 VM1
ID=67890
기존 VM1
ID=12345
복구 과정
- 신규 vm생성 (ID 67890)
- vms pool에 있는 rbd 67890_disk삭제
- 12345_disk를 67890_disk로 이름변경
- 이걸 모든vm에 적용...
[root@ceph01 ~]# rbd list -p vms
12345_disk
67890_disk
[root@ceph01 ~]# rbd rm -p vms 67890_disk
Removing image: 100% complete...done.
[root@ceph01 ~]#
[root@ceph01 ~]# rbd mv -p vms 12345_disk 67890_disk
[root@ceph01 ~]# rbd ls -p vms
67890_disk

03. 해결 과정
Redhat Ceph 3.1 upgrade 후 ...
- 비슷한 문제 발생
- pg 1.11f 를 갖고 있는 osd들이 up down을 반복 함.
[root@ceph-mon01 osc]# ceph -s
cluster:
id: f5078395-0236-47fd-ad02-8a6daadc7475
health: HEALTH_ERR
noscrub,nodeep-scrub flag(s) set
5 scrub errors
Possible data damage: 1 pg inconsistent
services:
mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
mgr: ceph-mon01(active), standbys: ceph-mon02, ceph-mon03
osd: 128 osds: 128 up, 128 in
flags noscrub,nodeep-scrub
data:
pools: 4 pools, 10240 pgs
objects: 12200k objects, 46865 GB
usage: 137 TB used, 97628 GB / 232 TB avail
pgs: 10239 active+clean
1 active+clean+inconsistent
io:
client: 0 B/s rd, 1232 kB/s wr, 19 op/s rd, 59 op/s wr
[root@ceph-mon01 osc]# ceph health detail
HEALTH_ERR noscrub,nodeep-scrub flag(s) set; 5 scrub errors; Possible data
damage: 1 pg inconsistent
OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
OSD_SCRUB_ERRORS 5 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 1.11f is active+clean+inconsistent, acting [113,105,10]
[root@ceph-mon01 osc]#
OTL...

03. 해결 과정
하지만 문제되는 Object를 특정지을 수 있었음.
[root@ceph-mon01 ~]# rados list-inconsistent-obj 1.11f --format=json-pretty
{
"epoch": 34376,
"inconsistents": [
{
"object": {
"name": "rbd_data.39edab651c7b53.0000000000003600",
"nspace": "",
"locator": "",

03. 해결 과정
Object rbd_data.39edab651c7b53.0000000000003600는 고객 DB Service vm의 root filesystem volume이었음.
다행이도 DB data에는 문제가 없었고...
문제가 된 DB vm의 root filesystem을 담고 있는 RBD image를 삭제 함. 하지만 여전히 상태는 HEALTH_ERR ...
[root@ceph-mon01 osc]# ceph -s
cluster:
id: f5078395-0236-47fd-ad02-8a6daadc7475
health: HEALTH_ERR
4 scrub errors
Possible data damage: 1 pg inconsistent, 1 pg snaptrim_error
services:
osd: 128 osds: 128 up, 128 in
data:
1 active+clean+inconsistent+snaptrim_error
io:
client: 0 B/s rd, 351 kB/s wr, 15 op/s rd, 51 op/s wr
[root@ceph-mon01 osc]# ceph health detail
HEALTH_ERR 4 scrub errors; Possible data damage: 1 pg inconsistent, 1 pg
snaptrim_error
OSD_SCRUB_ERRORS 4 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent, 1 pg snaptrim_error
pg 1.11f is active+clean+inconsistent+snaptrim_error, acting [113,105,10]

03. 해결 과정
- 문제되는 object의 snapshot id 54(0x36)이 문제가 되어서 error가 발생중임.
- ??? 이미 지웠는데??
2018-11-16 18:45:00.163319 7fb827aca700 -1 log_channel(cluster) log [ERR] : 1.11f shard 10: soid 1:f886c0a3:::rbd_data.39edab651c7b53.0000000000003600:36
data_digest 0x43d61c5d != data_digest 0x86baff34 from auth oi 1:f886c0a3::: rbd_data.39edab651c7b53.0000000000003600:36(14027'236814 osd.113.0:29524 [36]
dirty|data_digest|omap_digest s 4194304 uv 235954 dd 86baff34 od ffffffff alloc_hint [0 0 0])
$ printf "%dn" 0x36
54
[root@ceph-osd08 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-113 --pgid 1.11f --op list | grep 39edab651c7b53
Error getting attr on : 1.11f_head,#-3:f8800000:::scrub_1.11f:head#, (61) No data available
["1.11f",{"oid":"rbd_data.39edab651c7b53.0000000000003600","key":"","snapid":54,"hash":3305333023,"max":0,"pool":1,"namespace":"","max":0}]
["1.11f",{"oid":"rbd_data.39edab651c7b53.0000000000003600","key":"","snapid":63,"hash":3305333023,"max":0,"pool":1,"namespace":"","max":0}]
["1.11f",{"oid":"rbd_data.39edab651c7b53.0000000000003600","key":"","snapid":-2,"hash":3305333023,"max":0,"pool":1,"namespace":"","max":0}]

03. 해결 과정
- 문제되는 object를 갖고 있는 rbd image를 찾아보자!
[root@ceph-mon01 osc]# rbd info volume-13076ffc-6520-4db8-b238-1ba6108bfe52 -p volumes
rbd image 'volume-13076ffc-6520-4db8-b238-1ba6108bfe52':
size 53248 MB in 13312 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.62cb510d494de
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
[root@ceph-mon01 osc]# cat rbd-info.sh
#!/bin/bash
for i in `rbd list -p volumes`
do
rbd info volumes/$i |grep rbd_data.39edab651c7b53
echo --- $i done ----
done
[root@ceph-mon01 osc]# bash rbd-info.sh
rbd info에서 object의
prefix를 볼 수 있다.
모든 rbd image에서
문제되는 object를
찾는 script
[root@ceph-mon01 osc]# bash rbd-info.sh
--- rbdtest done ----
--- volume-00b0de1a-bfab-40e0-a444-b6c2d0de3905 done ----
--- volume-02d9c884-fc30-4700-87fd-950855ae361d done ----
...
[root@ceph-mon01 osc]# 결과는 ...
역시나 없음...

03. 해결 과정
- 해당 snapshot을 갖고 있는 volume이 삭제 되었으니 오류에 대한 조건이 더이상 존재하지 않음.
- repair를 다시 해보라고 함.
[root@ceph-mon01 ~]# date ; ceph pg repair 1.11f
Wed Nov 28 18:16:25 KST 2018
instructing pg 1.11f on osd.113 to repair
[root@ceph-mon01 ~]# ceph health detail
HEALTH_ERR noscrub,nodeep-scrub flag(s) set; Possible data damage: 1 pg repair
OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
PG_DAMAGED Possible data damage: 1 pg repair
pg 1.11f is active+clean+scrubbing+deep+repair, acting [113,105,10]
cluster:
id: f5078395-0236-47fd-ad02-8a6daadc7475
health: HEALTH_ERR
noscrub,nodeep-scrub flag(s) set
Possible data damage: 1 pg repair
services:
osd: 128 osds: 128 up, 128 in
flags noscrub,nodeep-scrub
data:
1 active+clean+scrubbing+deep+repair
io:
client: 598 kB/s rd, 1145 kB/s wr, 18 op/s rd, 63 op/s wr
pg 1.11f를 repair중

03. 해결 과정
- ceph log를 확인.
[root@ceph-mon01 ~]# ceph -w
...
2018-11-28 18:21:26.654955 osd.113 [ERR] 1.11f repair stat mismatch, got 3310/3312 objects, 91/92 clones, 3243/3245
dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 67/68 whiteouts, 13579894784/13584089088 bytes, 0/0 hit_set_archive bytes.
2018-11-28 18:21:26.655657 osd.113 [ERR] 1.11f repair 1 errors, 1 fixed
2018-11-28 18:19:28.979704 mon.ceph-mon01 [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg repair)
2018-11-28 18:20:30.652593 mon.ceph-mon01 [WRN] Health check update: nodeep-scrub flag(s) set (OSDMAP_FLAGS)
2018-11-28 18:20:35.394445 mon.ceph-mon01 [INF] Health check cleared: OSDMAP_FLAGS (was: nodeep-scrub flag(s) set)
2018-11-28 18:20:35.394457 mon.ceph-mon01 [INF] Cluster is now healthy
어..?! fixed???

03. 해결 과정
- HEALTH_OK
cluster:
id: f5078395-0236-47fd-ad02-8a6daadc7475
health: HEALTH_OK
services:
osd: 128 osds: 128 up, 128 in
data:
24 active+clean+scrubbing+deep
io:
client: 424 kB/s rd, 766 kB/s wr, 18 op/s rd, 72 op/s wr

Q&A
오픈소스컨설팅 이 영주
( yjlee@osci.kr )

Thank you
감사합니다
Cloud & Collaboration
T. 02-516-0711 E. sales@osci.kr
서울시강남구 테헤란로83길32,5층(삼성동, 나라키움삼성동A빌딩)
www.osci.kr

Ceph issue 해결 사례

More Related Content

What's hot

Similar to Ceph issue 해결 사례

More from Open Source Consulting

Recently uploaded

In this document

Ceph issue 해결 사례