Page 1 of 1

SATA problems AMD + Samsung SSD

Posted: 2019/06/12 09:32:35
by altis0
I'm having intermittent problems with the SATA interface to the SSD.

OS : Centos 7 with latest yum updates.
Motherboard : Asus M5A78L-M/USB3
CPU : AMD Opteron 3350HE
RAM : 4 x 8GB PC3L-12800
Southbridge : AMD SB710
SSD : 256 GB Samsung SSD 860 Pro

The errors appear to start in the host interface (SErr 0x800):

Code: Select all

Jun 11 14:38:01 lintel kernel: ata1.00: ATA-11: Samsung SSD 860 PRO 256GB, RVM01B6Q, max UDMA/133
Jun 11 14:38:01 lintel kernel: ata1.00: 500118192 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
Jun 11 14:38:01 lintel kernel: ata1.00: supports DRM functions and may not be fully accessible
Jun 11 14:38:01 lintel kernel: ata1.00: configured for UDMA/133
Jun 11 14:38:02 lintel kernel: ata1.00: Enabling discard_zeroes_data
Jun 11 14:38:02 lintel kernel: ata1.00: Enabling discard_zeroes_data
Jun 11 14:38:02 lintel kernel: ata1.00: Enabling discard_zeroes_data
<snip>
Jun 12 07:34:46 lintel kernel: ata1.00: exception Emask 0x50 SAct 0x1c000000 SErr 0x800 action 0x6 frozen
Jun 12 07:34:46 lintel kernel: ata1.00: irq_stat 0x08000000, interface fatal error
Jun 12 07:34:46 lintel kernel: ata1: SError: { HostInt }
Jun 12 07:34:46 lintel kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jun 12 07:34:46 lintel kernel: ata1.00: cmd 61/20:d0:00:5a:96/00:00:17:00:00/40 tag 26 ncq 16384 out#012         res 40/00:d0:00:5a:96/00:00:17:00:00/40 Emask 0x50 (ATA bus error)
Jun 12 07:34:46 lintel kernel: ata1.00: status: { DRDY }
Jun 12 07:34:46 lintel kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jun 12 07:34:46 lintel kernel: ata1.00: cmd 61/20:d8:40:2c:b6/00:00:1a:00:00/40 tag 27 ncq 16384 out#012         res 40/00:d0:00:5a:96/00:00:17:00:00/40 Emask 0x50 (ATA bus error)
Jun 12 07:34:46 lintel kernel: ata1.00: status: { DRDY }
Jun 12 07:34:46 lintel kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jun 12 07:34:46 lintel kernel: ata1.00: cmd 61/02:e0:4f:99:af/00:00:1a:00:00/40 tag 28 ncq 1024 out#012         res 40/00:d0:00:5a:96/00:00:17:00:00/40 Emask 0x50 (ATA bus error)
Jun 12 07:34:46 lintel kernel: ata1.00: status: { DRDY }
Jun 12 07:34:46 lintel kernel: ata1: hard resetting link
Jun 12 07:34:46 lintel kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun 12 07:34:46 lintel kernel: ata1.00: supports DRM functions and may not be fully accessible
Jun 12 07:34:46 lintel kernel: ata1.00: supports DRM functions and may not be fully accessible
Jun 12 07:34:46 lintel kernel: ata1.00: configured for UDMA/133
Jun 12 07:34:46 lintel kernel: ata1: EH complete
Jun 12 07:34:46 lintel kernel: ata1.00: Enabling discard_zeroes_data
In this example there are only 3 queued commands. Commonly, there are 31 - the maximum allowed under NCQ.

The BIOS is configured to use IDE mode and not AHCI but, it seems, we still end up command queueing anyway.

Some unresolved incompatibility between AMD's AHCI implementation and Samsung SSDs is widely reported on the internet. Sadly, I only found these after building and installing the system.

Re: SATA problems AMD + Samsung SSD

Posted: 2019/06/12 10:38:41
by TrevorH
The BIOS is configured to use IDE mode and not AHCI but, it seems, we still end up command queueing anyway.
Why? SSDs basically require AHCI to work properly.

If you've not already done so, change the SATA cable.

Re: SATA problems AMD + Samsung SSD

Posted: 2019/06/12 10:47:56
by altis0
I've tried several SATA cables and even a different PSU. Eventually, I always see the same problems.

In the same system are two Seagate Baracuda HDDs arranged as a RAID array. I never see a problem with these.

Note the error code 0x800. This means:

Internal Error (E):
The host bus adapter experienced an internal error
that caused the operation to fail and may have put the host bus adapter
into an error state. The internal error may include a master or target
abort when attempting to access system memory, an elasticity buffer
overflow, a primitive mis-alignment, a synchronization FIFO overflow,
and other internal error conditions. Typically when an internal error
occurs, a non-fatal or fatal status bit in the PxIS register will also be set
to give software guidance on the recovery mechanism required.


From:
https://www.intel.com/content/dam/www/p ... rev1_3.pdf

Re: SATA problems AMD + Samsung SSD

Posted: 2019/06/12 11:33:27
by stevemowbray
We had similar problems with some Integral SSDs. We used the following kernel parameter to disable NCQ:
libata.force=noncq

I don't know whether that will work for your particular mix of hardware but it might be worth a try.

Re: SATA problems AMD + Samsung SSD

Posted: 2019/06/12 11:58:44
by altis0
Thanks, that sounds promising.

Bit of a Centos novice here. Where do I put that parameter?

Re: SATA problems AMD + Samsung SSD

Posted: 2019/06/12 12:04:49
by stevemowbray
You can update your kernel command line with grubby:

grubby --update-kernel=ALL --args="libata.force=noncq"

Re: SATA problems AMD + Samsung SSD

Posted: 2019/06/12 12:57:59
by altis0
Thanks. I seem to have typed the right thing:

Code: Select all

Jun 12 13:50:37 lintel kernel: ata1: SATA max UDMA/133 abar m1024@0xf9dffc00 port 0xf9dffd00 irq 22
Jun 12 13:50:37 lintel kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun 12 13:50:37 lintel kernel: ata1.00: FORCE: horkage modified (noncq)
Jun 12 13:50:37 lintel kernel: ata1.00: supports DRM functions and may not be fully accessible
Jun 12 13:50:37 lintel kernel: ata1.00: ATA-11: Samsung SSD 860 PRO 256GB, RVM01B6Q, max UDMA/133
Jun 12 13:50:37 lintel kernel: ata1.00: 500118192 sectors, multi 1: LBA48 NCQ (not used)
Jun 12 13:50:37 lintel kernel: ata1.00: supports DRM functions and may not be fully accessible
Jun 12 13:50:37 lintel kernel: ata1.00: configured for UDMA/133
Jun 12 13:50:38 lintel kernel: ata1.00: Enabling discard_zeroes_data
Jun 12 13:50:38 lintel kernel: ata1.00: Enabling discard_zeroes_data
Jun 12 13:50:38 lintel kernel: ata1.00: Enabling discard_zeroes_data
Now I just have to wait and see if I get any more errors.

Re: SATA problems AMD + Samsung SSD

Posted: 2019/06/14 18:38:05
by altis0
Over 48 hours in and there have been no more errors so it looks like that one's fixed. Many thanks for your help.

Code: Select all

Jun 12 13:50:37 lintel kernel: ata1: SATA max UDMA/133 abar m1024@0xf9dffc00 port 0xf9dffd00 irq 22
Jun 12 13:50:37 lintel kernel: ata2: SATA max UDMA/133 abar m1024@0xf9dffc00 port 0xf9dffd80 irq 22
Jun 12 13:50:37 lintel kernel: ata3: SATA max UDMA/133 abar m1024@0xf9dffc00 port 0xf9dffe00 irq 22
Jun 12 13:50:37 lintel kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun 12 13:50:37 lintel kernel: ata1.00: FORCE: horkage modified (noncq)
Jun 12 13:50:37 lintel kernel: ata1.00: supports DRM functions and may not be fully accessible
Jun 12 13:50:37 lintel kernel: ata1.00: ATA-11: Samsung SSD 860 PRO 256GB, RVM01B6Q, max UDMA/133
Jun 12 13:50:37 lintel kernel: ata1.00: 500118192 sectors, multi 1: LBA48 NCQ (not used)
Jun 12 13:50:37 lintel kernel: ata1.00: supports DRM functions and may not be fully accessible
Jun 12 13:50:37 lintel kernel: ata1.00: configured for UDMA/133
Jun 12 13:50:37 lintel kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun 12 13:50:37 lintel kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun 12 13:50:37 lintel kernel: ata2.00: FORCE: horkage modified (noncq)
Jun 12 13:50:37 lintel kernel: ata2.00: ATA-10: ST8000DM005-2EH112, DN03, max UDMA/133
Jun 12 13:50:37 lintel kernel: ata2.00: 15628053168 sectors, multi 16: LBA48 NCQ (not used)
Jun 12 13:50:37 lintel kernel: ata2.00: configured for UDMA/133
Jun 12 13:50:37 lintel kernel: ata3.00: FORCE: horkage modified (noncq)
Jun 12 13:50:37 lintel kernel: ata3.00: ATA-10: ST8000DM005-2EH112, DN03, max UDMA/133
Jun 12 13:50:37 lintel kernel: ata3.00: 15628053168 sectors, multi 16: LBA48 NCQ (not used)
Jun 12 13:50:37 lintel kernel: ata3.00: configured for UDMA/133
Jun 12 13:50:38 lintel kernel: ata1.00: Enabling discard_zeroes_data
Jun 12 13:50:38 lintel kernel: ata1.00: Enabling discard_zeroes_data
Jun 12 13:50:38 lintel kernel: ata1.00: Enabling discard_zeroes_data
Jun 12 13:50:43 lintel kernel: ata2.00: configured for UDMA/133
Jun 12 13:50:43 lintel kernel: ata2: EH complete
Jun 12 13:50:43 lintel kernel: ata3.00: configured for UDMA/133
Jun 12 13:50:43 lintel kernel: ata3: EH complete

Re: SATA problems AMD + Samsung SSD

Posted: 2019/10/12 19:00:10
by zerofire
That Opteron should never have been placed in an AM3 board since it is an AM3+ processor. Also there are a few known issues with the SB7x0 chips.

Linux platform:
HPET operation with MSI causes LPC DMA corruption on devices using LPC DMA (floppy, parallel port, serial port in FIR mode) because MSI requests are misinterpreted as DMA cycles by the broken LPC controller
USB freeze when multiple devices are connected through hub (related to AMD Product Advisory PA_SB700AK1)
Erratic behavior of the HPET when Spread Spectrum is enabled (related to AMD Product Advisory PA_SB700AG2)
Disabling legacy interrupts for SATA disables MSI too
SATA soft reset fails when PMP is enabled and attached devices will not be detected
SATA internal errors are ignored because the controller will set Serial ATA port Error when it should not

That last one seams like it might be applying to you. For an SSD the link really needs to be in AHCI or RAID mode.