mpt2sas driver behaving strange with a failed SATA disk behind SAS expander.

Discussion:

Fredrik Lindgren

2011-08-17 14:25:24 UTC

Hello,

I'm seeing something strange on a Supermicro 847E16-R1400. It has SAS
expanders
with SATA disks behind them (Seagate Barracuda XT). The SAS card is a
LSI SAS9211-8i.

When doing disk IO on the disks (they are all configured in MD raids)
suddenly IO will
stop and these messages are printed on the console about once every second:

mpt2sas0: log_info(0x31110610): originator(PL), code(0x11), sub_code(0x0610)

From what I understand this means:

PL_LOGINFO_CODE_RESET (0x00110000)
PL_LOGINFO_SUB_CODE_SATA_NON_NCQ_RW_ERR_BIT_SET (0x00000600)

So a disk is acting up, generating errors? What does the last "10" mean
in the sub_code,
is that an identifier for which disk it is?

After some time, the message changed:

mpt2sas0: log info(0x31111000): originator(PL), code(0x11), sub code(0x1000)

Now the disk seems to have died completely?

PL_LOGINFO_CODE_RESET (0x00110000)
PL_LOGINFO_SUB_CODE_DSCVRY_SATA_INIT_TIMEOUT (0x00001000)

What bothers me is that the machine is just hanging there with IO
blocking for the disk
in question (I guess, this was gong on for several hours) there was no
SCSI-errors and the
drive in question was not ejected from the MD array. After rebooting it
started to rebuild
the MD array, promptly got stuck again and just sat there until the disk
was removed from
the array and it was restarted again.

This was with a stock Debian Squeeze kernel
(linux-image-2.6.32-5-amd64). I got the exact same
thing with a vanilla 3.0.1 from kernel.org.

Regards,
Fredrik Lindgren

----

dmesg from 3.0.1:

mpt2sas version 08.100.00.02 loaded
mpt2sas 0000:06:00.0: PCI INT A -> GSI 26 (level, low) -> IRQ 26
mpt2sas 0000:06:00.0: setting latency timer to 64
mpt2sas0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (49559612 kB)
mpt2sas 0000:06:00.0: irq 72 for MSI/MSI-X
mpt2sas0: PCI-MSI-X enabled: IRQ 72
mpt2sas0: iomem(0x00000000fbc3c000), mapped(0xffffc90006068000), size(16384)
mpt2sas0: ioport(0x000000000000d000), size(256)
mpt2sas0: sending diag reset !!
mpt2sas0: diag reset: SUCCESS
mpt2sas0: Allocated physical memory: size(3971 kB)
mpt2sas0: Current Controller Queue Depth(1739), Max Controller Queue
Depth(2000)
mpt2sas0: Scatter Gather Elements per IO(128)
mpt2sas0: LSISAS2008: FWVersion(09.00.00.00), ChipRevision(0x03),
BiosVersion(07.17.00.00)
mpt2sas0: Protocol=(Initiator,Target),
Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
Full,NCQ)
mpt2sas0: sending port enable !!
mpt2sas0: host_add: handle(0x0001), sas_addr(0x500605b0034da7c0), phys(8)
mpt2sas0: expander_add: handle(0x0009), parent(0x0001),
sas_addr(0x5003048001016e7f), phys(38)
mpt2sas0: expander_add: handle(0x0023), parent(0x0002),
sas_addr(0x5003048000f6b57f), phys(30)
mpt2sas0: port enable: SUCCESS

***@weathergirl:~# smp_rep_manufacturer /dev/bsg/expander-6\:0
Report manufacturer response:
Expander change count: 85
SAS-1.1 format: 1
vendor identification: LSI CORP
product identification: SAS2X36
product revision level: 0717
component vendor identification: LSI
component id: 547
component revision level: 5
***@weathergirl:~# smp_rep_manufacturer /dev/bsg/expander-6\:1
Report manufacturer response:
Expander change count: 67
SAS-1.1 format: 1
vendor identification: LSI CORP
product identification: SAS2X28
product revision level: 0717
component vendor identification: LSI
component id: 545
component revision level: 5

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Peter Chang

2011-08-17 17:08:09 UTC

Permalink

Post by Fredrik Lindgren
When doing disk IO on the disks (they are all configured in MD raids)
suddenly IO will
stop and these messages are printed on the console about once every s=
mpt2sas0: log_info(0x31110610): originator(PL), code(0x11), sub_code(=

0x0610)

Post by Fredrik Lindgren
PL_LOGINFO_CODE_RESET (0x00110000)
PL_LOGINFO_SUB_CODE_SATA_NON_NCQ_RW_ERR_BIT_SET (0x00000600)
So a disk is acting up, generating errors? What does the last "10" me=

an in

Post by Fredrik Lindgren
the sub_code,
is that an identifier for which disk it is?

no, the bottom bts are still part of the error code.

i haven't run w/ your exact fw/driver setup, but i think you'll find
that you're in a 'loop' where the driver is returning DID_RESET and
the scsi layer is retrying w/o going through the retry counter logic
(the command that fails is one that the firmware issued).

\p
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Peter Chang

2011-08-17 18:49:28 UTC

Permalink

Post by Peter Chang

When doing disk IO on the disks (they are all configured in MD raids=

)

Post by Peter Chang

suddenly IO will
stop and these messages are printed on the console about once every =
mpt2sas0: log_info(0x31110610): originator(PL), code(0x11), sub_code=

(0x0610)

Post by Peter Chang

PL_LOGINFO_CODE_RESET (0x00110000)
PL_LOGINFO_SUB_CODE_SATA_NON_NCQ_RW_ERR_BIT_SET (0x00000600)
So a disk is acting up, generating errors? What does the last "10" m=

ean in

Post by Peter Chang

the sub_code,
is that an identifier for which disk it is?

since someone else gave the error code (i didn't check if i just had
some other magic header)...

the problem is probably a combination of the disk and controller
firmwares. when an NCQ request fails the firmware will do a READ LOG
EXT(10) to figure out why. some disks don't do handle this sequence
the way the firmware expects so it starts the COMRESET dance w/ the
disk and returns an event w/ the loginfo to the driver/kernel.

the 'fix' (really a workaround) is in
mpt2sas_scsih.c:_scsih_io_done(). in the case for
MPI2_IOCSTATUS_SCSI_TASK_TERMINATED change the DID_RESET to
DID_SOFT_ERROR and the rest of the scsi layer will go down the regular
retry handling and you'll get out of the 'loop'.

lsi supposed to have this fix coming soon.

disabling NCQ will 'fix' this as well.

\p
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Ravi Shankar

2011-08-17 18:35:00 UTC

Permalink

Post by Fredrik Lindgren
Hello,
I'm seeing something strange on a Supermicro 847E16-R1400. It has SAS
expanders
with SATA disks behind them (Seagate Barracuda XT). The SAS card is a
LSI SAS9211-8i.
When doing disk IO on the disks (they are all configured in MD raids)
suddenly IO will
mpt2sas0: log_info(0x31110610): originator(PL), code(0x11),
sub_code(0x0610)
PL_LOGINFO_CODE_RESET (0x00110000)
PL_LOGINFO_SUB_CODE_SATA_NON_NCQ_RW_ERR_BIT_SET (0x00000600)
So a disk is acting up, generating errors? What does the last "10"
mean in the sub_code,
is that an identifier for which disk it is?
mpt2sas0: log info(0x31111000): originator(PL), code(0x11), sub code(0x1000)
Now the disk seems to have died completely?
PL_LOGINFO_CODE_RESET (0x00110000)
PL_LOGINFO_SUB_CODE_DSCVRY_SATA_INIT_TIMEOUT (0x00001000)

I think sub code (0x610) indicates "Error in SATA ReadLogExt SATA
command" and subsequently the disk drive failed
to initialize (SATA initialization timeout). Since you've connected
through Expander, the link between Disk and Expander
should be actively transmitting FIS frames. You can verify whether Disk
link is up by checking Expander Routing Tables.

Reduce the link speed (from 6 to 3 Gb/s) between HBA-Exp-Disk and try
disabling Native Cmd Queuing and see whether it helps.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html