cLVM is driving me nuts

Issues related to applications and software problems
hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: cLVM is driving me nuts

Post by hunter86_bg » 2019/06/01 23:58:18

Failed Actions:
* ilocsn2_start_0 on csn1 'unknown error' (1): call=49, status=Timed Out, exitreason='',
last-rc-change='Sun May 26 03:23:37 2019', queued=0ms, exec=60303ms
* ilocsn1_monitor_60000 on csn1 'unknown error' (1): call=34, status=Timed Out, exitreason='',
last-rc-change='Sun May 26 03:24:32 2019', queued=0ms, exec=20004ms
* fs_start_0 on csn2 'not installed' (5): call=31, status=complete, exitreason='Couldn't find device [/dev/clustered_vg/kvm-gfs]. Expected /dev/??? to exist',
last-rc-change='Sun May 26 03:29:49 2019', queued=0ms, exec=89ms
You have issues with your ilos. That's why SBD was offered.

Disabling LVmetad is very important - no more lvm caching.

You still haven't

jeffinto
Posts: 9
Joined: 2019/05/09 19:52:32

Re: cLVM is driving me nuts

Post by jeffinto » 2019/06/02 00:37:37

I rebooted the one node for something else and it was recorded. This newer pcs status after updating and rebooting both nodes shows it's working.

Code: Select all

# pcs status
Cluster name: cluster_core
Stack: corosync
Current DC: csn2 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum
Last updated: Sat Jun  1 20:23:13 2019
Last change: Fri May 31 22:47:33 2019 by root via cibadmin on csn1

2 nodes configured
8 resources configured

Online: [ csn1 csn2 ]

Full list of resources:

 Clone Set: dlm-clone [dlm]
     Started: [ csn1 csn2 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ csn1 csn2 ]
 ilocsn1	(stonith:fence_ilo3):	Started csn2
 ilocsn2	(stonith:fence_ilo3):	Started csn1
 Clone Set: fs-clone [fs]
     Started: [ csn2 ]
     Stopped: [ csn1 ]

Failed Actions:
* fs_start_0 on csn1-gfs 'not installed' (5): call=31, status=complete, exitreason='Couldn't find device [/dev/clustered_vg/kvm_gfs]. Expected /dev/??? to exist',
    last-rc-change='Fri May 31 22:59:55 2019', queued=0ms, exec=90ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
And I have set use_lvmetad = 0 in lvm.conf. I did that the same time I set locking_type = 3

I appreciate the suggestions. Anything else you can think of to check?

aks
Posts: 3073
Joined: 2014/09/20 11:22:14

Re: cLVM is driving me nuts

Post by aks » 2019/06/02 16:50:02

Anything else you can think of to check?
Uh yeah. Write top both SCSI targets at the same time (or as close as you can). If you don't end up with a steaming pile of poo, then it's probably okay (for at least the types of writes you have issued).

jeffinto
Posts: 9
Joined: 2019/05/09 19:52:32

Re: cLVM is driving me nuts

Post by jeffinto » 2019/06/11 01:56:10

The hardware is verified to allow both machines to read and write to each disk at the same time.

The cluster has one corosync ring

I've verified in the /log/messages that LVM is locking through corosync

The STONITH IO fencing via ilo3 is verified to work

lvm.conf has locking_type =3 and lvmetad is disabled

Yet I still get an error on the second node to boot that it "Error locking on node: Volume is busy on another node". There doesn't seem to be anything configurable about how DLM handles cLVM locks so I'm still totally lost on how to fix this.

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: cLVM is driving me nuts

Post by hunter86_bg » 2019/06/11 18:38:09

The main question is why this one is happening:
fs_start_0 on csn2 'not installed' (5): call=31, status=complete, exitreason='Couldn't find device [/dev/clustered_vg/kvm-gfs]. Expected /dev/??? to exist',
So, is your LVs both active on the two nodes?
The error indicates an inactive VG.
Can you provide your cluster configuration ?
This one can help:

Code: Select all

pcs config

jeffinto
Posts: 9
Joined: 2019/05/09 19:52:32

Re: cLVM is driving me nuts

Post by jeffinto » 2019/06/12 11:14:56

There is indeed an inactive VG. From the node having issues I can vgscan and see the volume.

vgchange -a y results in the error that the volume is busy on another node and it cannot get a lock.

Code: Select all

# pcs config
Cluster Name: cluster_core
Corosync Nodes:
 csn1 csn2
Pacemaker Nodes:
 csn1 csn2

Resources:
 Clone: dlm-clone
  Meta Attrs: interleave=true ordered=true 
  Resource: dlm (class=ocf provider=pacemaker type=controld)
   Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s)
               start interval=0s timeout=90 (dlm-start-interval-0s)
               stop interval=0s timeout=100 (dlm-stop-interval-0s)
 Clone: clvmd-clone
  Meta Attrs: interleave=true ordered=true with_cmirrord=true 
  Resource: clvmd (class=ocf provider=heartbeat type=clvm)
   Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s)
               start interval=0s timeout=90s (clvmd-start-interval-0s)
               stop interval=0s timeout=90s (clvmd-stop-interval-0s)
 Clone: fs-clone
  Resource: fs (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/clustered_vg/kvm_gfs directory=/var/lib/libvirt/images fstype=gfs2
   Operations: monitor interval=20s timeout=40s (fs-monitor-interval-20s)
               notify interval=0s timeout=60s (fs-notify-interval-0s)
               start interval=0s timeout=60s (fs-start-interval-0s)
               stop interval=0s timeout=60s (fs-stop-interval-0s)

Stonith Devices:
 Resource: csn1 (class=stonith type=fence_ilo3)
  Attributes: ipaddr=csn1 login=* passwd=* pcmk_host_list=csn1
  Operations: monitor interval=60s (csn1-monitor-interval-60s)
 Resource: csn2 (class=stonith type=fence_ilo3)
  Attributes: ipaddr=csn2 login=* passwd=* pcmk_host_list=csn2
  Operations: monitor interval=60s (csn2-monitor-interval-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
  start dlm-clone then start clvmd-clone (kind:Mandatory)
  start clvmd-clone then start fs-clone (kind:Mandatory)
Colocation Constraints:
  clvmd-clone with dlm-clone (score:INFINITY)
  fs-clone with clvmd-clone (score:INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: cluster_core
 dc-version: 1.1.19-8.el7_6.4-c3c624ea3d
 have-watchdog: false
 last-lrm-refresh: 1560201528
 no-quorum-policy: freeze
 stonith-enabled: true

Quorum:
  Options:
    wait_for_all: 0

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: cLVM is driving me nuts

Post by hunter86_bg » 2019/06/14 14:13:05

Clvmd connects with dlm and dlm connects remotely to the other side.
Keep in mind that these processes do have systemd units.

Check their status and also check if the dlm & clvmd are running.

jeffinto
Posts: 9
Joined: 2019/05/09 19:52:32

Re: cLVM is driving me nuts

Post by jeffinto » 2019/06/14 15:59:21

Hi hunter86,

I'm sure those services are running. killing the process on either node causes it to get reset by STONITH

Code: Select all

[root@csn1 ~]# ps -e|grep dlm
 8322 ?        00:00:01 dlm_controld
 8485 ?        00:00:07 dlm_scand
 8486 ?        00:00:00 dlm_recv
 8487 ?        00:00:00 dlm_send
 8488 ?        00:00:00 dlm_recoverd
10114 ?        00:00:00 dlm_callback
10115 ?        00:00:00 dlm_recoverd
[root@csn1 ~]# ps -e|grep clvmd
 8484 ?        00:00:00 clvmd

Code: Select all

[root@csn2 ~]# ps -e|grep dlm
 8333 ?        00:00:01 dlm_controld
 8489 ?        00:00:05 dlm_scand
 8490 ?        00:00:00 dlm_recv
 8491 ?        00:00:00 dlm_send
 8492 ?        00:00:00 dlm_recoverd
[root@csn2 ~]# ps -e|grep clvmd
 8488 ?        00:00:00 clvmd
I can also see it making the connections between lvm and DLM in the logs.

Code: Select all

Jun 10 16:18:40 csn1 pengine[8057]:  notice:  * Start      clvmd:0     ( csn1 )
Jun 10 16:18:40 csn1 crmd[8058]:  notice: Initiating monitor operation clvmd:0_monitor_0 locally on csn1
Jun 10 16:18:40 csn1 clvm(clvmd)[8165]: INFO: clvmd is not running
Jun 10 16:18:41 csn1 crmd[8058]:  notice: Result of probe operation for clvmd on csn1: 7 (not running)
Jun 10 16:18:51 csn1 crmd[8058]:  notice: Initiating start operation clvmd:0_start_0 locally on csn1
Jun 10 16:18:51 csn1 clvm(clvmd)[8347]: INFO: clvmd is not running
Jun 10 16:18:51 csn1 clvm(clvmd)[8347]: INFO: clvmd is not running
Jun 10 16:18:51 csn1 clvm(clvmd)[8347]: INFO: Starting /usr/sbin/clvmd:
Jun 10 16:18:53 csn1 clvmd: Cluster LVM daemon started - connected to Corosync
Jun 10 16:18:53 csn1 lvm[8673]: Monitoring RAID device clustered_vg-kvm_gfs for events.
Jun 10 16:18:53 csn1 clvm(clvmd)[8347]: INFO:  1 logical volume(s) in volume group "clustered_vg" now active
Jun 10 16:18:53 csn1 clvm(clvmd)[8347]: INFO: PID file (pid:8484 at /var/run/resource-agents/clvmd-clvmd.pid) created for clvmd.
Jun 10 16:18:53 csn1 crmd[8058]:  notice: Result of start operation for clvmd on csn1: 0 (ok)
Jun 10 16:18:53 csn1 crmd[8058]:  notice: Initiating monitor operation clvmd:0_monitor_30000 locally on csn1
Jun 10 16:23:16 csn1 pengine[8057]:  notice:  * Start      clvmd:1     ( csn2 )
Jun 10 16:23:16 csn1 pengine[8057]:  notice:  * Restart    fs:0        (             csn1 )   due to required clvmd-clone running
Jun 10 16:23:16 csn1 crmd[8058]:  notice: Initiating monitor operation clvmd:1_monitor_0 on csn2
Jun 10 16:23:18 csn1 crmd[8058]:  notice: Initiating start operation clvmd:1_start_0 on csn2
Jun 10 16:23:20 csn1 crmd[8058]:  notice: Initiating monitor operation clvmd:1_monitor_30000 on csn2
I'm not sure if the "clvmd-clone" is possibly the issue, but it does say it's monitoring the other nodes clvmd.

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: cLVM is driving me nuts

Post by hunter86_bg » 2019/06/15 19:09:20

I will try to reproduce your issue in my lab and will post my results.

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: cLVM is driving me nuts

Post by hunter86_bg » 2019/06/20 15:37:39

I have created a 2 node cluster with gfs2 and it seems that I cannot reproduce the issue.
I will try to provide all steps I have done so far in order to help you find the issue.

Post Reply