Issues related to applications and software problems
-
hunter86_bg
- Posts: 2019
- Joined: 2015/02/17 15:14:33
- Location: Bulgaria
-
Contact:
Post
by hunter86_bg » 2019/06/01 23:58:18
Failed Actions:
* ilocsn2_start_0 on csn1 'unknown error' (1): call=49, status=Timed Out, exitreason='',
last-rc-change='Sun May 26 03:23:37 2019', queued=0ms, exec=60303ms
* ilocsn1_monitor_60000 on csn1 'unknown error' (1): call=34, status=Timed Out, exitreason='',
last-rc-change='Sun May 26 03:24:32 2019', queued=0ms, exec=20004ms
* fs_start_0 on csn2 'not installed' (5): call=31, status=complete, exitreason='Couldn't find device [/dev/clustered_vg/kvm-gfs]. Expected /dev/??? to exist',
last-rc-change='Sun May 26 03:29:49 2019', queued=0ms, exec=89ms
You have issues with your ilos. That's why SBD was offered.
Disabling LVmetad is very important - no more lvm caching.
You still haven't
-
jeffinto
- Posts: 9
- Joined: 2019/05/09 19:52:32
Post
by jeffinto » 2019/06/02 00:37:37
I rebooted the one node for something else and it was recorded. This newer pcs status after updating and rebooting both nodes shows it's working.
Code: Select all
# pcs status
Cluster name: cluster_core
Stack: corosync
Current DC: csn2 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum
Last updated: Sat Jun 1 20:23:13 2019
Last change: Fri May 31 22:47:33 2019 by root via cibadmin on csn1
2 nodes configured
8 resources configured
Online: [ csn1 csn2 ]
Full list of resources:
Clone Set: dlm-clone [dlm]
Started: [ csn1 csn2 ]
Clone Set: clvmd-clone [clvmd]
Started: [ csn1 csn2 ]
ilocsn1 (stonith:fence_ilo3): Started csn2
ilocsn2 (stonith:fence_ilo3): Started csn1
Clone Set: fs-clone [fs]
Started: [ csn2 ]
Stopped: [ csn1 ]
Failed Actions:
* fs_start_0 on csn1-gfs 'not installed' (5): call=31, status=complete, exitreason='Couldn't find device [/dev/clustered_vg/kvm_gfs]. Expected /dev/??? to exist',
last-rc-change='Fri May 31 22:59:55 2019', queued=0ms, exec=90ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
And I have set use_lvmetad = 0 in lvm.conf. I did that the same time I set locking_type = 3
I appreciate the suggestions. Anything else you can think of to check?
-
aks
- Posts: 3073
- Joined: 2014/09/20 11:22:14
Post
by aks » 2019/06/02 16:50:02
Anything else you can think of to check?
Uh yeah. Write top both SCSI targets at the same time (or as close as you can). If you don't end up with a steaming pile of poo, then it's probably okay (for at least the types of writes you have issued).
-
jeffinto
- Posts: 9
- Joined: 2019/05/09 19:52:32
Post
by jeffinto » 2019/06/11 01:56:10
The hardware is verified to allow both machines to read and write to each disk at the same time.
The cluster has one corosync ring
I've verified in the /log/messages that LVM is locking through corosync
The STONITH IO fencing via ilo3 is verified to work
lvm.conf has locking_type =3 and lvmetad is disabled
Yet I still get an error on the second node to boot that it "Error locking on node: Volume is busy on another node". There doesn't seem to be anything configurable about how DLM handles cLVM locks so I'm still totally lost on how to fix this.
-
hunter86_bg
- Posts: 2019
- Joined: 2015/02/17 15:14:33
- Location: Bulgaria
-
Contact:
Post
by hunter86_bg » 2019/06/11 18:38:09
The main question is why this one is happening:
fs_start_0 on csn2 'not installed' (5): call=31, status=complete, exitreason='Couldn't find device [/dev/clustered_vg/kvm-gfs]. Expected /dev/??? to exist',
So, is your LVs both active on the two nodes?
The error indicates an inactive VG.
Can you provide your cluster configuration ?
This one can help:
-
jeffinto
- Posts: 9
- Joined: 2019/05/09 19:52:32
Post
by jeffinto » 2019/06/12 11:14:56
There is indeed an inactive VG. From the node having issues I can vgscan and see the volume.
vgchange -a y results in the error that the volume is busy on another node and it cannot get a lock.
Code: Select all
# pcs config
Cluster Name: cluster_core
Corosync Nodes:
csn1 csn2
Pacemaker Nodes:
csn1 csn2
Resources:
Clone: dlm-clone
Meta Attrs: interleave=true ordered=true
Resource: dlm (class=ocf provider=pacemaker type=controld)
Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s)
start interval=0s timeout=90 (dlm-start-interval-0s)
stop interval=0s timeout=100 (dlm-stop-interval-0s)
Clone: clvmd-clone
Meta Attrs: interleave=true ordered=true with_cmirrord=true
Resource: clvmd (class=ocf provider=heartbeat type=clvm)
Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s)
start interval=0s timeout=90s (clvmd-start-interval-0s)
stop interval=0s timeout=90s (clvmd-stop-interval-0s)
Clone: fs-clone
Resource: fs (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/clustered_vg/kvm_gfs directory=/var/lib/libvirt/images fstype=gfs2
Operations: monitor interval=20s timeout=40s (fs-monitor-interval-20s)
notify interval=0s timeout=60s (fs-notify-interval-0s)
start interval=0s timeout=60s (fs-start-interval-0s)
stop interval=0s timeout=60s (fs-stop-interval-0s)
Stonith Devices:
Resource: csn1 (class=stonith type=fence_ilo3)
Attributes: ipaddr=csn1 login=* passwd=* pcmk_host_list=csn1
Operations: monitor interval=60s (csn1-monitor-interval-60s)
Resource: csn2 (class=stonith type=fence_ilo3)
Attributes: ipaddr=csn2 login=* passwd=* pcmk_host_list=csn2
Operations: monitor interval=60s (csn2-monitor-interval-60s)
Fencing Levels:
Location Constraints:
Ordering Constraints:
start dlm-clone then start clvmd-clone (kind:Mandatory)
start clvmd-clone then start fs-clone (kind:Mandatory)
Colocation Constraints:
clvmd-clone with dlm-clone (score:INFINITY)
fs-clone with clvmd-clone (score:INFINITY)
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
No defaults set
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: cluster_core
dc-version: 1.1.19-8.el7_6.4-c3c624ea3d
have-watchdog: false
last-lrm-refresh: 1560201528
no-quorum-policy: freeze
stonith-enabled: true
Quorum:
Options:
wait_for_all: 0
-
hunter86_bg
- Posts: 2019
- Joined: 2015/02/17 15:14:33
- Location: Bulgaria
-
Contact:
Post
by hunter86_bg » 2019/06/14 14:13:05
Clvmd connects with dlm and dlm connects remotely to the other side.
Keep in mind that these processes do have systemd units.
Check their status and also check if the dlm & clvmd are running.
-
jeffinto
- Posts: 9
- Joined: 2019/05/09 19:52:32
Post
by jeffinto » 2019/06/14 15:59:21
Hi hunter86,
I'm sure those services are running. killing the process on either node causes it to get reset by STONITH
Code: Select all
[root@csn1 ~]# ps -e|grep dlm
8322 ? 00:00:01 dlm_controld
8485 ? 00:00:07 dlm_scand
8486 ? 00:00:00 dlm_recv
8487 ? 00:00:00 dlm_send
8488 ? 00:00:00 dlm_recoverd
10114 ? 00:00:00 dlm_callback
10115 ? 00:00:00 dlm_recoverd
[root@csn1 ~]# ps -e|grep clvmd
8484 ? 00:00:00 clvmd
Code: Select all
[root@csn2 ~]# ps -e|grep dlm
8333 ? 00:00:01 dlm_controld
8489 ? 00:00:05 dlm_scand
8490 ? 00:00:00 dlm_recv
8491 ? 00:00:00 dlm_send
8492 ? 00:00:00 dlm_recoverd
[root@csn2 ~]# ps -e|grep clvmd
8488 ? 00:00:00 clvmd
I can also see it making the connections between lvm and DLM in the logs.
Code: Select all
Jun 10 16:18:40 csn1 pengine[8057]: notice: * Start clvmd:0 ( csn1 )
Jun 10 16:18:40 csn1 crmd[8058]: notice: Initiating monitor operation clvmd:0_monitor_0 locally on csn1
Jun 10 16:18:40 csn1 clvm(clvmd)[8165]: INFO: clvmd is not running
Jun 10 16:18:41 csn1 crmd[8058]: notice: Result of probe operation for clvmd on csn1: 7 (not running)
Jun 10 16:18:51 csn1 crmd[8058]: notice: Initiating start operation clvmd:0_start_0 locally on csn1
Jun 10 16:18:51 csn1 clvm(clvmd)[8347]: INFO: clvmd is not running
Jun 10 16:18:51 csn1 clvm(clvmd)[8347]: INFO: clvmd is not running
Jun 10 16:18:51 csn1 clvm(clvmd)[8347]: INFO: Starting /usr/sbin/clvmd:
Jun 10 16:18:53 csn1 clvmd: Cluster LVM daemon started - connected to Corosync
Jun 10 16:18:53 csn1 lvm[8673]: Monitoring RAID device clustered_vg-kvm_gfs for events.
Jun 10 16:18:53 csn1 clvm(clvmd)[8347]: INFO: 1 logical volume(s) in volume group "clustered_vg" now active
Jun 10 16:18:53 csn1 clvm(clvmd)[8347]: INFO: PID file (pid:8484 at /var/run/resource-agents/clvmd-clvmd.pid) created for clvmd.
Jun 10 16:18:53 csn1 crmd[8058]: notice: Result of start operation for clvmd on csn1: 0 (ok)
Jun 10 16:18:53 csn1 crmd[8058]: notice: Initiating monitor operation clvmd:0_monitor_30000 locally on csn1
Jun 10 16:23:16 csn1 pengine[8057]: notice: * Start clvmd:1 ( csn2 )
Jun 10 16:23:16 csn1 pengine[8057]: notice: * Restart fs:0 ( csn1 ) due to required clvmd-clone running
Jun 10 16:23:16 csn1 crmd[8058]: notice: Initiating monitor operation clvmd:1_monitor_0 on csn2
Jun 10 16:23:18 csn1 crmd[8058]: notice: Initiating start operation clvmd:1_start_0 on csn2
Jun 10 16:23:20 csn1 crmd[8058]: notice: Initiating monitor operation clvmd:1_monitor_30000 on csn2
I'm not sure if the "clvmd-clone" is possibly the issue, but it does say it's monitoring the other nodes clvmd.
-
hunter86_bg
- Posts: 2019
- Joined: 2015/02/17 15:14:33
- Location: Bulgaria
-
Contact:
Post
by hunter86_bg » 2019/06/15 19:09:20
I will try to reproduce your issue in my lab and will post my results.
-
hunter86_bg
- Posts: 2019
- Joined: 2015/02/17 15:14:33
- Location: Bulgaria
-
Contact:
Post
by hunter86_bg » 2019/06/20 15:37:39
I have created a 2 node cluster with gfs2 and it seems that I cannot reproduce the issue.
I will try to provide all steps I have done so far in order to help you find the issue.