failed to start crash recovery kernel arming after install NVIDIA CUDA and cuDNN

hopefulp · Post by **hopefulp** » 2019/07/30 09:45:20

I am using CentOS 7 , and there are three kernels in grub (3.10...957.., 3.10. 8..., and rescue mode.
At the moment I have installed CUDA and cuDNN, during that time, I have made a conda virtual envrionment with different python version.
But sample file of CUDA made an error, and there was an advice of reboot.
Without removing anaconda activation in .bashrc file, I have rebooted and disregard some update that was stored for update.
After rebooting it went to the graphical insterface and after login the monitor showed unstable, a frantic change of the x windows.
I have logged out and login as super user but still it made a chaos in x windows.
By remote connection, I have commented the anaconda activation part in user's home .bashrc and reboot,
Then, it shows "Failed to start Crash recovery kernel arming" in all three kinds of boot option.
Now what should I do? (except reinstalling OS)

By remote connection: sudo systemctl status kdump.service shows as following:
● kdump.service - Crash recovery kernel arming
Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2019-07-30 18:34:10 KST; 9min ago
Process: 4273 ExecStart=/usr/bin/kdumpctl start (code=exited, status=1/FAILURE)
Main PID: 4273 (code=exited, status=1/FAILURE)

Jul 30 18:34:10 chi systemd[1]: Starting Crash recovery kernel arming...
Jul 30 18:34:10 chi kdumpctl[4273]: No memory reserved for crash kernel
Jul 30 18:34:10 chi kdumpctl[4273]: Starting kdump: [FAILED]
Jul 30 18:34:10 chi systemd[1]: kdump.service: main process exited, code=exited, status=1/FAILURE
Jul 30 18:34:10 chi systemd[1]: Failed to start Crash recovery kernel arming.
Jul 30 18:34:10 chi systemd[1]: Unit kdump.service entered failed state.
Jul 30 18:34:10 chi systemd[1]: kdump.service failed.

Post by **TrevorH** » 2019/07/30 10:14:38

Do you have the technical ability to analyze a kdump? If not then I would just disable the kdump service.

lightman47 · Post by **lightman47** » 2019/07/30 11:55:21

For what it's worth - I installed cuda, which installed cuda-libs this past Sunday. My machine arrived at the same point thereafter and un-installing the two was of no help. What got me going again (2-3hours later) was yum -y reinstall \*.\* I suspect my x11 stuff may have gotten hosed. I had tried to re-install kmod-nvidia to no avail.

hopefulp · Post by **hopefulp** » 2019/07/30 12:02:47

TrevorH wrote: ↑
2019/07/30 10:14:38
Do you have the technical ability to analyze a kdump? If not then I would just disable the kdump service.

I don't have it.
So I have disabled kdump
#systemctl disable kdump.service
#reboot

But it looks it is stuck around there.
last a few messages are
14.... IPv6: ADDRCONF(NETDEV_UP): virbr0: link is not ready
14...... virbr0: port 1(virbr0-nic) entered disabled state

Post by **TrevorH** » 2019/07/30 12:26:50

Are you using an nvidia card for your graphics and trying to boot in GUI mode?

hopefulp · Post by **hopefulp** » 2019/07/30 12:45:36

#lspci
NVIDIA .... GeForce GTX 1060 3GB

#systemctl get-default
graphical.target

I removed the lated kernel and reinstalled but nothing is changed.

Post by **TrevorH** » 2019/07/30 12:55:50

Time to start readings logs then. Start with /var/log/Xorg.*.log and /var/log/messages and see if they shed any light.

hopefulp · Post by **hopefulp** » 2019/07/30 13:18:01

Also I tested "systemctl set-default multi-user.target
and reboot, it goes to text login mode without any problem. So maybe it means no problem in kernel.

In reading /var/log
there are one old log file which shows successful log in Xorg.9.log (Jul 4) but all the other files look they show failure.
In Xorg.9.log: the (boot serial number?) shows 5, 6 and the goes to 70 and terminated successfully (no message for NVIDIA)
In other Xorg.n.log: so many dumps with 12... related to NVIDIA. Some parts are
NVIDIA: Failed to initialize the NVIDIA kernel module, see system's the kernel log
no devices detected
Fetal server error:
no screens found
server terminated with error

Shall I reinstall NVIDIA driver?

Post by **TrevorH** » 2019/07/30 13:35:00

So, before you installed the cuda stuff, did you have graphics working? And was that already using the nvidia proprietary drivers or was it using the distro supplied 'nouveau' driver?

hopefulp · Post by **hopefulp** » 2019/07/30 13:52:20

I was using NVIDIA driver. NVIDA driver version could be seen in NVIDA-setting.
Now I have reinstalled NVIDIA driver, a little different verson. Previously 430.26 (I can't remember exactly) and reinstalled 430.34.
When firstly installing NVIDIA driver, there is complicate procedure, blacklist nouveou etc.
At this moment, I didn't do anything. Just installed NVIDIA-setting which uninstall the previous version.
And now, it is working.
Amazingly, CUDA, cuDNN example package mnistCUDNN is working also.
It looks it is solved. I have learned to read /var/log/Xorg... I appreciate it.

Previously, NVIDIA was working and after installed CUDA, it caused problem. The CUDA is not working after installation and there was an advice of rebooting. Rebooting should be careful that "comment out all the anaconda related activation". In rebooting it went to the graphical login, after that, it went to chaos, which meant that NVIDIA module was broken. This is my thought.

CentOS

failed to start crash recovery kernel arming after install NVIDIA CUDA and cuDNN

failed to start crash recovery kernel arming after install NVIDIA CUDA and cuDNN

Re: failed to start crash recovery kernel arming after install NVIDIA CUDA and cuDNN

Re: failed to start crash recovery kernel arming after install NVIDIA CUDA and cuDNN

Re: failed to start crash recovery kernel arming after install NVIDIA CUDA and cuDNN

Re: failed to start crash recovery kernel arming after install NVIDIA CUDA and cuDNN

Re: failed to start crash recovery kernel arming after install NVIDIA CUDA and cuDNN

Re: failed to start crash recovery kernel arming after install NVIDIA CUDA and cuDNN

Re: failed to start crash recovery kernel arming after install NVIDIA CUDA and cuDNN

Re: failed to start crash recovery kernel arming after install NVIDIA CUDA and cuDNN

Re: failed to start crash recovery kernel arming after install NVIDIA CUDA and cuDNN