[Tutorial] Software RAID fails on reboot and leads to emergency mode

Issues related to applications and software problems
Post Reply
alltheasimov
Posts: 1
Joined: 2017/12/02 22:06:53

[Tutorial] Software RAID fails on reboot and leads to emergency mode

Post by alltheasimov » 2017/12/02 23:02:42

Hi,

I'm new to CentOS, but I'd like to share what I've learned about creating RAID arrays with mdadm and CentOS7.

The original problem I was having:
I followed the normal online guides for creating a RAID 1 volume from two 3TB hdds that I plan to use to store data (not boot from). Link 1, Link 2. Following those guides, I successfully created a RAID 1 array and was able to write data to it. I created the mdadm.conf file, and added the correct line to fstab. I rebooted, but emergency mode launched. I guess any sort of GUI is not start-able from emergency mode? Anyways, I first thought it might be the nvidia drivers I installed from the elrepo project. Messing around with that didn't yield anything, so I ended up reinstalling CentOS. I went through the same process of doing updates, building the raid array, etc. Rebooted, and same thing happened. This time a combed through the ~4000 line log file. I had set up NFS from a different drive, so I saw a lot of errors related to that not starting. I thought that might have something to do with it, so I tried manually starting a lot of those processes, but they all failed. Back to the log file...before the NFS errors was a line that said the the md0 raid had failed to build, or something like that. It wasn't highlighted as an error, but I thought, aha! I went into fstab, commented out the raid line, rebooted, and it booted! I guess that that failure caused some sort of crazy cascade of failures that eventually resulted in emergency mode being triggered. However, there was no sign of the previously working raid array...no recognized superblocks, nothing, so I couldn't assemble or start it again. I spent about ~16 hours troubleshooting all of this and FINALLY came up with a solution.

Here are the steps to follow to create a software RAID array with mdadm in CentOS 7 that won't cause boot to fail. Do not leave any of these out.
  1. lsblk –figure out which disks you want to raid
  2. If you already have a raid running that you want to kill and remake, follow this guide.
  3. Delete partitions on the disks using, for example, the disk utility. You can skip this if the disks are new.
  4. Run the command

    Code: Select all

     sudo dd if=/dev/zero of=/dev/sda bs=1M count=500000 
    , where sda is the disk you want to over write with zeros, and count is the size of the device in MB. The goal is to wipe out any metadata (filesystems, etc…) left on the disk from previous fails. You can cntl+c it after a few minutes…just need to wipe first few sectors usually. You can skip this if the disks are new.
  5. Contents in the disk utility should say “unknown” now. If it says free space or unallocated space, go back and run the zeroing command longer.
  6. Now create the partitions: There are pros and cons to making the raid of partitions or the actual disks. I used partitions. Guide
  7. Code: Select all

    # sudo parted /dev/sda 
    (parted) print
  8. If these are blank drives there should be no partition. If a partition exists, go back and delete them. If a file system exists, go back and rezero.

    Code: Select all

    (parted) mklabel gpt 
    (parted) print
    (parted) mkpart primary 0% 100%
    (parted) set 1 raid on 
    (parted) align-check optimal 1 
    (parted) print 
    (parted) quit 
  9. The above lines do the following: set GUID partition label, check it, make primary parition, turn on "Linux RAID", check alignment of partition 1, final check of partition. Another check you can do:

    Code: Select all

    sudo gdisk -l /dev/sda 
  10. Now repeat the above for the other disk (sdb for me).
  11. Create raid array (don't forget the 1's, i.e. "sda1" not "sda", if you are using partitions):

    Code: Select all

    sudo mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
  12. You can check the syncing progress (which takes a few hours) with these:

    Code: Select all

    cat /proc/mdstat
    sudo mdadm –detail /dev/md0
  13. While you may not have to wait for it to finish, I do just to be safe.
  14. The following steps are for making the filesystem and mounting it. You could create logical volumes before creating the filesystem and mounting, but you don't have to. Guide

    Code: Select all

    sudo mkfs.ext4 –F /dev/md0
    sudo mkdir /data
    sudo chown User /data
    sudo chmod 775 /data
    sudo mount /dev/md0 /data
    
  15. You don't have to do the chown step, but it makes using the /data folder easier. Now try writing something to the RAID array, e.g. "touch test" or something like that.
  16. Create madadm.conf file, which is used at boot to build the array:

    Code: Select all

    mdadm --detail --scan > /etc/mdadm.conf
  17. Note, I used the --detail and not the --examine option. Here's a link for the differences. Also note the ">" vs ">>". You probably want the former because it will overwrite a bad .conf file instead of append, but if you have multiple arrays, you may want to append.
  18. Edit fstab so it is persistent across reboots:

    Code: Select all

    # vi /etc/fstab
    /dev/md0	/data		ext4	defaults	0    0
  19. IMPORTANT part. You need to rebuild the initial ramdisk image (initramfs). Guide 1 with recovery info, Guide 2. Create a backup first, then rebuild (all su):

    Code: Select all

    # cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
    # ll /boot/initramfs-$(uname -r).img*
    # dracut -f
    
  20. Moment of truth: Reboot. If it works, YAY! :D If not, :cry: ...try again?
I think I had two main problems when I was trying to make the raid array (the first 4 times I tried...): 1. I had left over metadata/filesystems from previous failures, 2. I didn't do the iniramfs rebuild step because it's not in the main online guides. :? The first reference to it I found buried in an ubuntu forum, but debian uses a different command for doing it.

I hope this saves some people some headache.

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: [Tutorial] Software RAID fails on reboot and leads to emergency mode

Post by hunter86_bg » 2017/12/03 16:30:59

I can also share my experience with RAID1 .
On day 1, I was really surprised to hear that they use software raid in order to ease the Storage from replicating thousand of LUNs, but they work without any issues.

The only thing I noticed in your guide that needs fixing is using non-persiting naming (lke /dev/sda) instead of persisting one (all in /dev/disk/by- are considered persitent).

Post Reply