Page 1 of 1

ECC error (mcelog)

Posted: 2019/02/06 15:48:02
by CANnix
Hi all,

I am running a CentOS 6 server on a Xeon X5650 machine (SMP). During high CPU load the server crashes regularly after some time. In the MCElog I can trace the error back to data transfer between RAM and MCU:
...
STATUS 88010282 MCGSTATUS 0
MCGCAP 1c09 APICID 35 SOCKETID 1
CPUID Vendor Intel Family 6 Model 44
Hardware event. This is not a software error.
MCE 0
CPU 20 BANK 6 TSC b7065eeaa1810
TIME 1545643603 Mon Dec 24 10:26:43 2018
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Data CACHE Level-2 Generic Error
STATUS b200000080000106 MCGSTATUS 4
MCGCAP 1c09 APICID 13 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
Hardware event. This is not a software error.
MCE 0
CPU 4 BANK 6 TSC b7065eeaa18b0
TIME 1545643603 Mon Dec 24 10:26:43 2018
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Data CACHE Level-2 Generic Error
STATUS b200000080000106 MCGSTATUS 4
MCGCAP 1c09 APICID 4 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 8
MISC 5222508000086200
TIME 1547586533 Tue Jan 15 22:08:53 2019
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCA: MEMORY CONTROLLER MS_CHANNELunspecified_ERR
Transaction: Memory scrubbing error
Memory ECC error occurred during scrub
Memory corrected error count (CORE_ERR_CNT): 1
Memory transaction Tracker ID (RTId): 0
Memory DIMM ID of error: 0
Memory channel ID of error: 2
Memory ECC syndrome: 52225080
STATUS 88000040000200cf MCGSTATUS 0
MCGCAP 1c09 APICID 20 SOCKETID 1
CPUID Vendor Intel Family 6 Model 44
Hardware event. This is not a software error.
MCE 0
...
Now the thing is that Memtest runs fine, except for the SMP version which crashes. AFAIK this is not too uncommon and does not necessarily indicate a hardware fault. To eliminate defect RAM as root cause the respective DIMM was replaced with a brand new one before the latest crash (the one documented in the log above).

Does anyone see a solution or a direction for further analysis?

Best regards

Re: ECC error (mcelog)

Posted: 2019/02/06 16:20:27
by TrevorH
Your latest one is on a different cpu and memory bank than the first two. First two are cpu 20 (and 4), bank 6, latest one is cpu 1 bank 8. If your machines are anything like mine then even numbered cpus are all on one socket and odd numbered ones are on the other socket. I think it likely that bank 6 on cpu 4 and 20 are the same memory module while bank 8 on cpu 1 is likely to be a different one.

If you have libvirt installed on there then running virsh capabilities is one way to see which cpu numbers belong to which socket. If you don't have libvirt then try lscpu -a --extended (part of the util-linux-ng package) instead and look for lines like

Code: Select all

NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23
Edit: updated yet again after I realised that this is CentOS 6 and lscpu is not installed by default and then again because it's old and doesn't know about lscpu -y

Re: ECC error (mcelog)

Posted: 2019/02/06 16:26:40
by CANnix
Thanks for the reply! Yes, I get the same output from lscpu as you posted. Am I right that defect RAM does not seem to be the problem here?

Re: ECC error (mcelog)

Posted: 2019/02/06 16:35:35
by TrevorH
Depends. Did you replace all your RAM or just the defective DIMM in node 0 bank 6? Because the latest MCE is on bank 8 on the other node. Sounds like a different DIMM to me.

Re: ECC error (mcelog)

Posted: 2019/02/06 17:02:18
by CANnix
I replaced the DIMMs of bank 6, 8 and 10.

Re: ECC error (mcelog)

Posted: 2019/02/06 17:26:53
by TrevorH
As far as I know banks exist on both sockets so if you replaced the DIMM in bank 8 on cpu 0 then you still have the error on bank 8 of cpu 1.

Re: ECC error (mcelog)

Posted: 2019/02/06 17:37:56
by CANnix
This is my mainboard: S5520UR (Intel). lshw lists 12 banks with indices from 0 to 11 and the architecture is SMP. And as both CPUs share the same memory address space, the bank names are probably unique.

The log indicates that the replaced DIMMs fail as much as the old ones. Is there a way to detect failures at the memory controller?

Re: ECC error (mcelog)

Posted: 2019/02/11 10:22:08
by CANnix
As we cannot find the root cause of the problem the server will be put out of service. However if anyone still has an idea how to find the problem, I would be glad to hear from you! Kind of a strange issue...

Re: ECC error (mcelog)

Posted: 2019/02/11 11:58:29
by tunk
You say it's happening during high load, so it could be stress or temperature related.
Do the fans work? Maybe the PSUs have deteriorated causing voltage instability under
high load. AFAIK the memory controller is built into the CPU - could it be bad contact
between CPU and motherboard?

Re: ECC error (mcelog)

Posted: 2019/02/11 13:35:26
by CANnix
It happens while rendering a simulation. However if I just stress the CPU it runs fine.