Discussion:
[Bug 14831] New: mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
b***@bugzilla.kernel.org
2009-12-18 11:25:45 UTC
Permalink
http://bugzilla.kernel.org/show_bug.cgi?id=14831

Summary: mptsas - Use of ATA command pass-through results in
unreliable operation - drive / controller resets
Product: SCSI Drivers
Version: 2.5
Kernel Version: 2.6.26 - 2.6.31
Platform: All
OS/Version: Linux
Tree: Mainline
Status: NEW
Severity: normal
Priority: P1
Component: Other
AssignedTo: scsi_drivers-***@kernel-bugs.osdl.org
ReportedBy: ***@seoss.co.uk
CC: ***@lsi.com
Regression: No


On Debian 2.6.26-2-amd64, and mptsas 3.04.13 from scsi-misc-2.6.git, use ATA
command pass-through on LSI SAS1068 and SAS1068E may result in:

. Device resets
. Device offline
. Controller offline (only observed on 2.6.26)

The problem seems to occur far more frequently with the SAS1068 (PCI version).

I haven't verified whether any data loss is occuring, but this does at least
seem to be a possibility.


For 2.6.26:

/lib/modules/2.6.26-2-amd64/kernel/drivers/message/fusion/mptsas.ko
version: 3.04.06
license: GPL
description: Fusion MPT SAS Host driver
author: LSI Corporation

.. and a couple of WesternDigitial SATA drives, I ran the following
command:

while true ; do smartctl -a /dev/sg0 > /dev/null ; done

After approx 45 minutes this happened:

kernel: [5060492.926757]
mptctldrivers/message/fusion/mptctl.c::mptctl_ioctl() @602 - Controller
disabled.


For 2.6.32-rc4 with mptsas 3.04.13:

[ 22.414415] mptsas: ioc0: attaching sata device: fw_channel 0, fw_id
9, phy 0, sas_addr 0x1221000000000000
[ 22.466953] mptsas: ioc0: attaching sata device: fw_channel 0, fw_id
1, phy 1, sas_addr 0x1221000001000000
[ 22.519305] mptsas: ioc0: attaching raid volume, channel 1, id 0
[ 33.727405] Fusion MPT misc device (ioctl) driver 3.04.13
[ 33.738270] mptctl: Registered with Fusion MPT base driver
[ 33.749277] mptctl: /dev/mptctl @ (major,minor=10,220)
[ 5300.611795] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[ 5300.629028] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[ 5300.646254] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[ 5300.663478] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[ 5300.680700] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[ 5300.697924] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[ 5312.111795] mptbase: ioc0: LogInfo(0x31130000): Originator={PL},
Code={IO Not Yet Executed}, SubCode(0x0000)
[ 5312.131469] mptscsih: ioc0: attempting task abort! (sc=ffff88012c5fc8c0)
[ 5312.156831] mptscsih: ioc0: task abort: FAILED (sc=ffff88012c5fc8c0)
[ 5312.169534] mptscsih: ioc0: attempting target reset!
(sc=ffff88012c5fc8c0)
[ 5312.195222] mptscsih: ioc0: target reset: FAILED (sc=ffff88012c5fc8c0)
[ 5312.208276] mptscsih: ioc0: attempting bus reset! (sc=ffff88012c5fc8c0)
[ 5316.612245] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88012c5fc8c0)
[ 5328.112389] mptbase: ioc0: LogInfo(0x31140000): Originator={PL},
Code={IO Executed}, SubCode(0x0000)
[ 5328.128508] mptscsih: ioc0: attempting host reset! (sc=ffff88012c5fc8c0)
[12537.867482] mptbase: ioc0: LogInfo(0x31140000): Originator={PL},
Code={IO Executed}, SubCode(0x0000)
[12537.885769] mptscsih: ioc0: attempting host reset! (sc=ffff88012d55c8c0)
[12537.899173] mptbase: ioc0: Initiating recovery
[12559.704264] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88012d55c8c0)
[44184.424640] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[44184.441866] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[44195.924782] mptbase: ioc0: LogInfo(0x31130000): Originator={PL},
Code={IO Not Yet Executed}, SubCode(0x0000)
[44195.944449] mptscsih: ioc0: attempting task abort! (sc=ffff88012c403ac0)
[44195.969799] mptscsih: ioc0: task abort: FAILED (sc=ffff88012c403ac0)
[44195.982500] mptscsih: ioc0: attempting target reset!
(sc=ffff88012c403ac0)
[44196.008182] mptscsih: ioc0: target reset: FAILED (sc=ffff88012c403ac0)
[44196.021230] mptscsih: ioc0: attempting bus reset! (sc=ffff88012c403ac0)
[44200.425026] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88012c403ac0)
[44211.925127] mptbase: ioc0: LogInfo(0x31140000): Originator={PL},
Code={IO Executed}, SubCode(0x0000)
[44211.943416] mptscsih: ioc0: attempting host reset! (sc=ffff88012c403ac0)
[44211.956814] mptbase: ioc0: Initiating recovery
[44233.760010] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88012c403ac0)
[49878.447977] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[49889.948381] mptbase: ioc0: LogInfo(0x31130000): Originator={PL},
Code={IO Not Yet Executed}, SubCode(0x0000)
[49889.968080] mptscsih: ioc0: attempting task abort! (sc=ffff88003799acc0)
[49889.993425] mptscsih: ioc0: task abort: FAILED (sc=ffff88003799acc0)
[49890.006129] mptscsih: ioc0: attempting target reset!
(sc=ffff88003799acc0)
[49890.031817] mptscsih: ioc0: target reset: FAILED (sc=ffff88003799acc0)
[49890.044869] mptscsih: ioc0: attempting bus reset! (sc=ffff88003799acc0)
[49894.448617] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88003799acc0)
[49905.948189] mptbase: ioc0: LogInfo(0x31140000): Originator={PL},
Code={IO Executed}, SubCode(0x0000)
[49905.966491] mptscsih: ioc0: attempting host reset! (sc=ffff88003799acc0)
[49905.979888] mptbase: ioc0: Initiating recovery
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
b***@bugzilla.kernel.org
2009-12-18 12:44:30 UTC
Permalink
http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #1 from kdesai <***@lsi.com> 2009-12-18 12:44:28 ---

Can you please provide firmware version information ?
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
b***@bugzilla.kernel.org
2009-12-18 15:18:39 UTC
Permalink
http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #2 from Tim Small <***@seoss.co.uk> 2009-12-18 15:18:37 ---
For the 1068 (Dell PE860):

Port Name Chip Vendor/Type/Rev MPT Rev Firmware Rev
1. /proc/mpt/ioc0 LSI Logic SAS1068 B0 105 000a3300

Current active firmware version is 0.10.51
Firmware image's version is MPTFW-00.10.51.00-IE
LSI Logic
x86 BIOS image's version is MPTBIOS-6.12.05.00 (2007.09.29)


For the 1068E: (Dell PE1950):
Port Name Chip Vendor/Type/Rev MPT Rev Firmware Rev
1. /proc/mpt/ioc0 LSI Logic SAS1068E 08 105 00192f00

Current active firmware version is 0.25.47
Firmware image's version is MPTFW-00.25.47.00-IE
LSI Logic
x86 BIOS image's version is MPTBIOS-6.22.03.00 (2008.08.06)


I've just started a test on a SAS1064 (Intel S5000PSL). Will leave it on test
and report back here...

Thanks,

Tim.
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
b***@bugzilla.kernel.org
2009-12-18 15:31:56 UTC
Permalink
http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #3 from Tim Small <***@seoss.co.uk> 2009-12-18 15:31:55 ---
Hmm, failed as soon as I submitted the last comment (Intel S5000PSL on-board
controller)...

filename:
/lib/modules/2.6.26-2-amd64/kernel/drivers/message/fusion/mptsas.ko
version: 3.04.06

Port Name Chip Vendor/Type/Rev MPT Rev Firmware Rev
1. /proc/mpt/ioc0 LSI Logic SAS1064E 04 105 01190100


Current active firmware version is 1.25.01
Firmware image's version is MPTFW-01.25.01.00-IT
LSI Logic
x86 BIOS image's version is MPTBIOS-6.22.00.00 (2008.04.10)


[ 2.369101] ioc0: LSISAS1064E B2: Capabilities={Initiator}
[ 2.371062] mptbase: ioc0: PCI-MSI enabled
[ 2.371612] PCI: Setting latency timer of device 0000:04:00.0 to 64
[ 18.377426] scsi0 : ioc0: LSISAS1064E B2, FwRev=01190100h, Ports=1,
MaxQ=478, IRQ=1269
[ 19.249357] scsi 0:0:0:0: Direct-Access ATA WDC WD3200BJKT-0 1A11
PQ: 0 ANSI: 5
[ 881.982165] mptbase: ioc0: LogInfo(0x30030108): Originator={IOP},
Code={Invalid Page}, SubCode(0x0108)
[ 882.086359] mptbase: ioc0: LogInfo(0x30030108): Originator={IOP},
Code={Invalid Page}, SubCode(0x0108)
[ 1514.521445] mptbase: ioc0: LogInfo(0x30030108): Originator={IOP},
Code={Invalid Page}, SubCode(0x0108)
[ 1514.525947] mptbase: ioc0: LogInfo(0x30030108): Originator={IOP},
Code={Invalid Page}, SubCode(0x0108)
[ 2051.568333] mptscsih: ioc0: attempting task abort! (sc=ffff8101190f6940)
[ 2051.568446] sd 0:0:0:0: [sda] CDB: ATA command pass through(16): 85 08 0e 00
00 00 01 00 00 00 00 00 00 00 ec 00
[ 2056.593202] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO
Executed}, SubCode(0x0000)
[ 2056.594064] mptsas: ioc0: removing sata device, channel 0, id 0, phy 0
[ 2056.594182] port-0:0: mptsas: ioc0: delete port (0)
[ 2056.617086] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 2056.797030] mptscsih: ioc0: task abort: SUCCESS (sc=ffff8101190f6940)
[ 2056.797166] mptscsih: ioc0: attempting task abort! (sc=ffff810124da15c0)
[ 2056.798195] sd 0:0:0:0: [sda] CDB: Synchronize Cache(10): 35 00 00 00 00 00
00 00 00 00
[ 2056.799567] mptscsih: ioc0: task abort: SUCCESS (sc=ffff810124da15c0)
[ 2056.799697] mptscsih: ioc0: attempting target reset! (sc=ffff8101190f6940)
[ 2056.799821] sd 0:0:0:0: [sda] CDB: ATA command pass through(16): 85 08 0e 00
00 00 01 00 00 00 00 00 00 00 ec 00
[ 2057.137289] mptscsih: ioc0: target reset: SUCCESS (sc=ffff8101190f6940)
[ 2057.140469] mptscsih: ioc0: attempting bus reset! (sc=ffff8101190f6940)
[ 2057.140585] sd 0:0:0:0: [sda] CDB: ATA command pass through(16): 85 08 0e 00
00 00 01 00 00 00 00 00 00 00 ec 00
[ 2057.398218] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff8101190f6940)
[ 2068.601852] mptscsih: ioc0: attempting host reset! (sc=ffff8101190f6940)
[ 2068.606217] mptbase: ioc0: Initiating recovery
[ 2082.235819] mptscsih: ioc0: host reset: SUCCESS (sc=ffff8101190f6940)
[ 2082.235932] sd 0:0:0:0: Device offlined - not ready after error recovery
[ 2082.236054] sd 0:0:0:0: Device offlined - not ready after error recovery
[ 2082.236197] end_request: I/O error, dev sda, sector 19534911
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
b***@bugzilla.kernel.org
2009-12-18 15:32:39 UTC
Permalink
http://bugzilla.kernel.org/show_bug.cgi?id=14831


Tim Small <***@seoss.co.uk> changed:

What |Removed |Added
----------------------------------------------------------------------------
Kernel Version|2.6.26 - 2.6.31 |2.6.26 -
| |2.6.32rc4-scsi-misc
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
b***@bugzilla.kernel.org
2009-12-21 04:51:49 UTC
Permalink
http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #4 from kdesai <***@lsi.com> 2009-12-21 04:51:47 ---
I tried the same test with below setup detail and It works fine for me. Can you
give try upgrading your FW version for 1068 B0 card as mentioned in comment #2.

I used 1068 B0 card and HDD is WesternDigitial SATA drives REVV: 1E01
FW version is 1.29.00.00-IE
Card Name is SAS3442X.

MPT driver version is 3.4.14 (latest upstream driver). see attachment
fusion_03_04_14.tgz for quick access to fusion driver. You may need to some
change code to make it compilable with your kernel.


Please give a try and let me know your result.

1.29.00 Fw is available at
http://lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/combo/sas3442x-r/index.html

Thanks,
Kashyap
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
b***@bugzilla.kernel.org
2009-12-21 04:52:57 UTC
Permalink
http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #5 from kdesai <***@lsi.com> 2009-12-21 04:52:56 ---
Created an attachment (id=24240)
--> (http://bugzilla.kernel.org/attachment.cgi?id=24240)
latest upstream fusion driver 3.4.14
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
b***@bugzilla.kernel.org
2009-12-21 12:08:52 UTC
Permalink
http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #6 from Tim Small <***@seoss.co.uk> 2009-12-21 12:08:50 ---
Hi Kashyap,

Thanks for your input. Unfortunately, I can't test on the 1068, as the machine
is now in production (with SMART disabled!).

I do still have access to the 1068E and the 1064, and I will see if I can
borrow another 1068.

Could you try the attached script on your test system? It carries out I/O to
the device which is under test, and seems to trigger failures much more quickly
as a result.

Thanks,

Tim.
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
b***@bugzilla.kernel.org
2009-12-21 12:11:32 UTC
Permalink
http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #7 from Tim Small <***@seoss.co.uk> 2009-12-21 12:11:31 ---
Created an attachment (id=24243)
--> (http://bugzilla.kernel.org/attachment.cgi?id=24243)
Script to stress-test ATA command passthrough whilst write-loading a SATA
device.

This script uses dd to repeatedly write and remove a 1G zero-filled file
to/from a file-system whilst executing smartctl against the associated device
file.
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
b***@bugzilla.kernel.org
2010-01-11 11:59:42 UTC
Permalink
http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #8 from amf <***@ukdedicated.com> 2010-01-11 11:59:39 ---
We see this on numerous Dell hosts running the SAS6iR based on the LSISAS1068E
chip. Running the stock RHEL5 driver.

It's simple to reproduce with SMART commands, but we actually see huge issues
with drives dropping off cards even during heavy I/O and no SMART commands
involved at all. It seems to be all the worse when the disk reallocates a bad
sector. The disks are Dell-supplied and therefore 'enterprise' models capable
of TLER type behaviour.

I believe the SMART command method makes it easier to reproduce what may be a
problem not specifically related to SMART, but that's just my own feeling.

HTH.
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
b***@bugzilla.kernel.org
2010-01-12 23:15:41 UTC
Permalink
http://bugzilla.kernel.org/show_bug.cgi?id=14831


Aaron Williams <***@gmail.com> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |***@gmail.com




--- Comment #9 from Aaron Williams <***@gmail.com> 2010-01-12 23:15:37 ---
I am seeing similar events with my current computer and with my last computer.
My setup consists of two WD Black Edition 1TB drives:

Model=WDC WD1001FALS-00J7B0, FwRev=05.00K05, SerialNo=WD-WMATV0705568
Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=50
BuffType=unknown, BuffSize=unknown, MaxMultSect=16, MultSect=off
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953525168
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
AdvancedPM=no WriteCache=enabled
Drive conforms to: Unspecified: ATA/ATAPI-1,2,3,4,5,6,7

00:1f.2 SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI
Controller

I also have smartmond running to periodically query the drive. In my case I
have two drives running in a mirrored configuration and this will arbitrarily
kick one of the drives out of my RAID array (using md). I just spent the last
few days recovering from a RAID event that killed the entire array (due to this
problem I believe).
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
b***@bugzilla.kernel.org
2010-01-13 12:30:50 UTC
Permalink
http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #10 from Tim Small <***@seoss.co.uk> 2010-01-13 12:30:47 ---
Hi Aaron,

It's possible that this is an unrelated issue. On one of the systems with the
MPT SAS controllers, I have moved a drive from an MPT SAS channel onto an Intel
631xESB/632xESB SATA channel, and the unreliable behaviour appeared to stop.

Do you have any other drives which you can test in place of the WD drives?
Personally I have found Hitachi SATA drives to be well engineered in recent
years from a SMART PoV.

If you'd like to open another bug, the script included in this bug might help
you reproduce the problems. You could also try disabling NCQ and/or using a
different SATA controller (Silicon Image SiI 3132 based PCIe cards are
available very inexpensively) to see if this helps.

Thanks,

Tim.
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...