Discussion:
megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]
Robin H. Johnson
2014-07-12 11:56:52 UTC
Permalink
TL;DR LSI2208 card faults out and does not bring up drives in Linux. In BIOS works fine.
Driver has no debug interfaces visible in code for early startup.

Hardware: Supermicro SSG-6027R-E1R12T
http://www.supermicro.com/products/system/2U/6027/SSG-6027R-E1R12T.cfm
Motherboard is X9DRH-7TF
Contains an LSI2208 controller (megaraid_sas), which is this bug.

I also have a LSI2008 (mp2sas) card in a PCIe slot for accessing an external
tape library, that works fine [it's in CPU2-SLOT6, PCIe v3 x8].

01:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] [1000:005b] (rev 05)
82:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03)
(full lspci output further down)

Whenever the megaraid_sas module loads, it fails out :-(.
[ 14.188561] megasas: 06.803.01.00-rc1 Mon. Mar. 10 17:00:00 PDT 2014
[ 14.188577] megasas: 0x1000:0x005b:0x15d9:0x0690: bus 1:slot 0:func 0
[ 14.188584] megaraid_sas 0000:01:00.0: enabling device (0000 -> 0002)
[ 14.188735] megasas: Waiting for FW to come to ready state
[ 14.193999] megasas: FW in FAULT state!!
[ 14.194003] megaraid_sas 0000:01:00.0: megasas: FW restarted successfully from megasas_init_fw!
[ 44.210482] megasas: Waiting for FW to come to ready state
[ 44.210484] megasas: FW in FAULT state!!

During boots of the system, it DOES cleanly probe the drives (6x ST32000641AS),
and has them assembled into RAID6.

The problem occurs in all of these kernels:
Ubuntu 3.13.11.2 (3.13.0-30.55-generic)
Vanilla 3.14.5
Ubuntu 3.16.0-rc4 (3.16.0-3.8~14.10-generic sic) from ppa:canonical-kernel-team/ppa
(quite willing to build custom kernels for testing, I just had these on hand
for quick reboots).

If you Google around for the problem, there were claims that it's related to
bug BKO63661 (https://bugzilla.kernel.org/show_bug.cgi?id=63661), amongst other things, suggesting the following workarounds:
pci=conf1
pcie_aspm=off
disable_msi=1
None of which have any affect.

# lspci -nn -d 1000: -vvxxx
01:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] [1000:005b] (rev 05)
Subsystem: Super Micro Computer Inc LSI MegaRAID ROMB [15d9:0690]
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 16
Region 0: I/O ports at 8000 [disabled] [size=256]
Region 1: Memory at dfe60000 (64-bit, non-prefetchable) [size=16K]
Region 3: Memory at dfe00000 (64-bit, non-prefetchable) [size=256K]
Expansion ROM at dfe40000 [disabled] [size=128K]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest+
Capabilities: [d0] Vital Product Data
pcilib: sysfs_read_vpd: read failed: Connection timed out
Not readable
Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [c0] MSI-X: Enable- Count=16 Masked-
Vector table: BAR=1 offset=00002000
PBA: BAR=1 offset=00003000
00: 00 10 5b 00 02 00 10 00 05 00 04 01 10 00 00 00
10: 01 80 00 00 04 00 e6 df 00 00 00 00 04 00 e0 df
20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 90 06
30: 00 00 e4 df 50 00 00 00 00 00 00 00 0b 01 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 01 00 00 10 d0 02 00 25 80 00 10
70: 20 28 00 00 83 04 40 00 40 00 83 10 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00
90: 00 00 00 00 0e 00 00 00 03 00 3e 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 11 00 0f 00 01 20 00 00 01 30 00 00 00 00 00 00
d0: 03 a8 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

82:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03)
Subsystem: Dell 6Gbps SAS HBA Adapter [1028:1f1c]
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 11
Region 0: I/O ports at f000 [disabled] [size=256]
Region 1: Memory at fbe40000 (64-bit, non-prefetchable) [disabled] [size=64K]
Region 3: Memory at fbe00000 (64-bit, non-prefetchable) [disabled] [size=256K]
Expansion ROM at fbd00000 [disabled] [size=1M]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [d0] Vital Product Data
Unknown small resource type 00, will not decode more.
Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [c0] MSI-X: Enable- Count=15 Masked-
Vector table: BAR=1 offset=0000e000
PBA: BAR=1 offset=0000f800
00: 00 10 72 00 00 00 10 00 03 00 07 01 10 00 00 00
10: 01 f0 00 00 04 00 e4 fb 00 00 00 00 04 00 e0 fb
20: 00 00 00 00 00 00 00 00 00 00 00 00 28 10 1c 1f
30: 00 00 d0 fb 50 00 00 00 00 00 00 00 0b 01 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 82 00 00 10 d0 02 00 25 80 00 10
70: 20 28 09 00 82 04 00 00 40 00 82 10 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00
90: 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 11 00 0e 00 01 e0 00 00 01 f8 00 00 00 00 00 00
d0: 03 a8 00 80 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
--
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead
E-Mail : ***@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
Bjorn Helgaas
2014-07-12 17:29:20 UTC
Permalink
[+cc Matthew]
Post by Robin H. Johnson
TL;DR LSI2208 card faults out and does not bring up drives in Linux. In BIOS works fine.
Driver has no debug interfaces visible in code for early startup.
Hardware: Supermicro SSG-6027R-E1R12T
http://www.supermicro.com/products/system/2U/6027/SSG-6027R-E1R12T.cfm
Motherboard is X9DRH-7TF
Contains an LSI2208 controller (megaraid_sas), which is this bug.
I also have a LSI2008 (mp2sas) card in a PCIe slot for accessing an external
tape library, that works fine [it's in CPU2-SLOT6, PCIe v3 x8].
01:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] [1000:005b] (rev 05)
82:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03)
(full lspci output further down)
Whenever the megaraid_sas module loads, it fails out :-(.
[ 14.188561] megasas: 06.803.01.00-rc1 Mon. Mar. 10 17:00:00 PDT 2014
[ 14.188577] megasas: 0x1000:0x005b:0x15d9:0x0690: bus 1:slot 0:func 0
[ 14.188584] megaraid_sas 0000:01:00.0: enabling device (0000 -> 0002)
[ 14.188735] megasas: Waiting for FW to come to ready state
[ 14.193999] megasas: FW in FAULT state!!
[ 14.194003] megaraid_sas 0000:01:00.0: megasas: FW restarted successfully from megasas_init_fw!
[ 44.210482] megasas: Waiting for FW to come to ready state
[ 44.210484] megasas: FW in FAULT state!!
During boots of the system, it DOES cleanly probe the drives (6x ST32000641AS),
and has them assembled into RAID6.
Ubuntu 3.13.11.2 (3.13.0-30.55-generic)
Vanilla 3.14.5
Ubuntu 3.16.0-rc4 (3.16.0-3.8~14.10-generic sic) from ppa:canonical-kernel-team/ppa
(quite willing to build custom kernels for testing, I just had these on hand
for quick reboots).
If you Google around for the problem, there were claims that it's related to
pci=conf1
pcie_aspm=off
disable_msi=1
None of which have any affect.
Thanks for the report, Robin.

https://bugzilla.kernel.org/show_bug.cgi?id=63661 bisected the problem
to 3c076351c402 ("PCI: Rework ASPM disable code"), which appeared in
v3.3. For starters, can you verify that, e.g., by building
69166fbf02c7 (the parent of 3c076351c402) to make sure that it works,
and building 3c076351c402 itself to make sure it fails?

Assuming that's the case, please attach the complete dmesg and "lspci
-vvxxx" output for both kernels to the bugzilla. ASPM is a feature
that is configured on both ends of a PCIe link, so I want to see the
lspci info for the whole system, not just the SAS adapters.

It's not practical to revert 3c076351c402 now, so I'd also like to see
the same information for the newest possible kernel (if this is
possible; I'm not clear on whether you can boot your system or not) so
we can figure out what needs to be changed.

Bjorn
Post by Robin H. Johnson
# lspci -nn -d 1000: -vvxxx
01:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] [1000:005b] (rev 05)
Subsystem: Super Micro Computer Inc LSI MegaRAID ROMB [15d9:0690]
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 16
Region 0: I/O ports at 8000 [disabled] [size=256]
Region 1: Memory at dfe60000 (64-bit, non-prefetchable) [size=16K]
Region 3: Memory at dfe00000 (64-bit, non-prefetchable) [size=256K]
Expansion ROM at dfe40000 [disabled] [size=128K]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest+
Capabilities: [d0] Vital Product Data
pcilib: sysfs_read_vpd: read failed: Connection timed out
Not readable
Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [c0] MSI-X: Enable- Count=16 Masked-
Vector table: BAR=1 offset=00002000
PBA: BAR=1 offset=00003000
00: 00 10 5b 00 02 00 10 00 05 00 04 01 10 00 00 00
10: 01 80 00 00 04 00 e6 df 00 00 00 00 04 00 e0 df
20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 90 06
30: 00 00 e4 df 50 00 00 00 00 00 00 00 0b 01 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 01 00 00 10 d0 02 00 25 80 00 10
70: 20 28 00 00 83 04 40 00 40 00 83 10 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00
90: 00 00 00 00 0e 00 00 00 03 00 3e 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 11 00 0f 00 01 20 00 00 01 30 00 00 00 00 00 00
d0: 03 a8 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
82:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03)
Subsystem: Dell 6Gbps SAS HBA Adapter [1028:1f1c]
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 11
Region 0: I/O ports at f000 [disabled] [size=256]
Region 1: Memory at fbe40000 (64-bit, non-prefetchable) [disabled] [size=64K]
Region 3: Memory at fbe00000 (64-bit, non-prefetchable) [disabled] [size=256K]
Expansion ROM at fbd00000 [disabled] [size=1M]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [d0] Vital Product Data
Unknown small resource type 00, will not decode more.
Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [c0] MSI-X: Enable- Count=15 Masked-
Vector table: BAR=1 offset=0000e000
PBA: BAR=1 offset=0000f800
00: 00 10 72 00 00 00 10 00 03 00 07 01 10 00 00 00
10: 01 f0 00 00 04 00 e4 fb 00 00 00 00 04 00 e0 fb
20: 00 00 00 00 00 00 00 00 00 00 00 00 28 10 1c 1f
30: 00 00 d0 fb 50 00 00 00 00 00 00 00 0b 01 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 82 00 00 10 d0 02 00 25 80 00 10
70: 20 28 09 00 82 04 00 00 40 00 82 10 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00
90: 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 11 00 0e 00 01 e0 00 00 01 f8 00 00 00 00 00 00
d0: 03 a8 00 80 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
--
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Robin H. Johnson
2014-07-13 01:35:51 UTC
Permalink
Post by Bjorn Helgaas
Thanks for the report, Robin.
https://bugzilla.kernel.org/show_bug.cgi?id=63661 bisected the problem
to 3c076351c402 ("PCI: Rework ASPM disable code"), which appeared in
v3.3. For starters, can you verify that, e.g., by building
69166fbf02c7 (the parent of 3c076351c402) to make sure that it works,
and building 3c076351c402 itself to make sure it fails?
Assuming that's the case, please attach the complete dmesg and "lspci
-vvxxx" output for both kernels to the bugzilla. ASPM is a feature
that is configured on both ends of a PCIe link, so I want to see the
lspci info for the whole system, not just the SAS adapters.
It's not practical to revert 3c076351c402 now, so I'd also like to see
the same information for the newest possible kernel (if this is
possible; I'm not clear on whether you can boot your system or not) so
we can figure out what needs to be changed.
TL;DR: FastBoot is leaving the MegaRaidSAS in a weird state, and it fails to
start; Commit 3c076351c402 did make it worse, but I think we're right that the
bug lies in the SAS code.

Ok, I have done more testing on it (40+ boots), and I think we can show the
problem is somewhere in how the BIOS/EFI/ROM brings up the card in FastBoot
more, or how it leaves the card.

Full boot of the system was difficult on the 3.2 kernels, they didn't make it
to userspace for other stuff being too new. For testing, I compiled
CONFIG_MEGARAID_SAS=y on 3.2, and =m on 3.16-rc4; that way when the initramfs &
userspace failed, the megaraid load was captured over IPMI serial.

I've done a lot of the analysis below while capturing.

I was going to be booting many times, so I flipped the 'Fast Boot'
option back to Disabled, so I could more easily get to the BIOS settings
to change options while testing. When I did so, an accidental boot on a
kernel that previously failed suddenly worked, leading me to raise an
eyebrow, and this expanded my test matrix more.

3 kernels, 6 different BIOS config combinations (2x3) = 18 test cases
Each configuration was booted at least twice; if the result of two boots was
not identical, I booted a third time and took the majority result.

All kernels had no boot params involving PCI specified (none of pci=, pcie*=,
disable_msi*).

Kernels:
K.1: Ubuntu's 3.16-rc4
K.2: 3.2-rc4 3c076351c402 - aspm merged
K.3: 3.2-rc4 69166fbf02c7 - aspm merge parent
Notes: 3.2* compiled with GCC4.6, 3.16-rc4 with GCC4.8

BIOS: Boot -> FastBoot:
B1.1 Off
B1.2 On (CMOS reset default)

BIOS: Advanced -> PCIe/PCI/PnP Configuration -> ASPM Support
B2.1 Force L0s
B2.2 BIOS (CMOS reset default)
B2.3 Disabled

Reduced Kernaugh Map of results:
Kernels,B1,B2: Result
*, B1.1, * PASS
*, B1.2, B2.1 VARIABLE (9 runs: 5 fail, 4 pass, no kernel consistency)
K.1, B1.2, B2.2 FAIL
K.1, B1.2, B2.3 FAIL
K.2, B1.2, B2.2 FAIL
K.2, B1.2, B2.3 FAIL
K.3, B1.2, B2.2 PASS
K.3, B1.2, B2.3 PASS

Here's the DMI info:
Motherboard: X9DRH-7TF/7F/iTF/iF
Version: 3.0b
Release Date: 04/28/2014

Recall also I said I had two LSI cards in here?
SAS2008 (in a slot) and SAS2208 (onboard)

Regardless of the BIOS settings, the SAS2008 card continues to work; even when
it's IO region0 is marked as disabled. So is there some other initialization
work needed on the SAS2208 card so that it works in all cases?

The case of FastBoot=on, ASPM=ForceL0s is the interesting one, and the
lspci outputs compare nicely; The only trimming to the diff below is to remove
the context of other devices (no changes).

This does also look functionally identical between 3c076351c402 and 69166fbf02c7.

Full lspci & dmesg for the working+broken 3.16-rc4 boots attaches.

-lspci.1405201451.ASPM=L0s.FastBoot.no.kparams = 3.16-rc4, working
+lspci.1405201693.ASPM=L0s.FastBoot.no.kparams = 3.16-rc4, broken
# diff -Nar lspci.1405201451.ASPM=L0s.FastBoot.no.kparams lspci.1405201693.ASPM=L0s.FastBoot.no.kparams -I '^[0-9a-f][0-9a-f]:' -F rev -U15
--- lspci.1405201451.ASPM=L0s.FastBoot.no.kparams 2014-07-12 21:44:11.243897367 +0000
+++ lspci.1405201693.ASPM=L0s.FastBoot.no.kparams 2014-07-12 21:48:13.866860888 +0000
@@ -1157,95 +1157,93 @@ 00:1f.6 Signal processing controller [11
(trim other device, no changes)
01:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] [1000:005b] (rev 05)
Subsystem: Super Micro Computer Inc LSI MegaRAID ROMB [15d9:0690]
- Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
+ Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
- Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 16
- Region 0: I/O ports at 8000 [size=256]
+ Region 0: I/O ports at 8000 [disabled] [size=256]
Region 1: Memory at dfe60000 (64-bit, non-prefetchable) [size=16K]
Region 3: Memory at dfe00000 (64-bit, non-prefetchable) [size=256K]
Expansion ROM at dfe40000 [disabled] [size=128K]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest+
Capabilities: [d0] Vital Product Data
- Unknown small resource type 00, will not decode more.
+ Not readable
Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
- Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
+ Capabilities: [c0] MSI-X: Enable- Count=16 Masked-
Vector table: BAR=1 offset=00002000
PBA: BAR=1 offset=00003000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [1e0 v1] #19
Capabilities: [1c0 v1] Power Budgeting <?>
Capabilities: [190 v1] #16
Capabilities: [148 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
- Kernel driver in use: megaraid_sas
-00: 00 10 5b 00 07 04 10 00 05 00 04 01 10 00 00 00
+00: 00 10 5b 00 02 00 10 00 05 00 04 01 10 00 00 00
10: 01 80 00 00 04 00 e6 df 00 00 00 00 04 00 e0 df
20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 90 06
30: 00 00 e4 df 50 00 00 00 00 00 00 00 0b 01 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 01 00 00 10 d0 02 00 25 80 00 10
70: 20 28 00 00 83 04 40 00 40 00 83 10 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00
90: 00 00 00 00 0e 00 00 00 03 00 3e 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
-c0: 11 00 0f 80 01 20 00 00 01 30 00 00 00 00 00 00
-d0: 03 a8 00 80 00 00 00 00 00 00 00 00 00 00 00 00
+c0: 11 00 0f 00 01 20 00 00 01 30 00 00 00 00 00 00
+d0: 03 a8 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

(trim other device, no changes)
@@ -3049,35 +3047,35 @@ 80:05.4 PIC [0800]: Intel Corporation Xe
(trim other device, no changes)
82:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03)
Subsystem: Dell 6Gbps SAS HBA Adapter [1028:1f1c]
- Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
+ Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 56
- Region 0: I/O ports at f000 [size=256]
+ Region 0: I/O ports at f000 [disabled] [size=256]
Region 1: Memory at fbe40000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at fbe00000 (64-bit, non-prefetchable) [size=256K]
Expansion ROM at fbd00000 [disabled] [size=1M]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [d0] Vital Product Data
Unknown small resource type 00, will not decode more.
Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [c0] MSI-X: Enable+ Count=15 Masked-
Vector table: BAR=1 offset=0000e000
PBA: BAR=1 offset=0000f800
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [138 v1] Power Budgeting <?>
Kernel driver in use: mpt2sas
-00: 00 10 72 00 07 04 10 00 03 00 07 01 10 00 00 00
+00: 00 10 72 00 06 04 10 00 03 00 07 01 10 00 00 00
10: 01 f0 00 00 04 00 e4 fb 00 00 00 00 04 00 e0 fb
20: 00 00 00 00 00 00 00 00 00 00 00 00 28 10 1c 1f
30: 00 00 d0 fb 50 00 00 00 00 00 00 00 0b 01 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 82 00 00 10 d0 02 00 25 80 00 10
70: 2f 28 09 00 82 04 00 00 40 00 82 10 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00
90: 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 11 00 0e 80 01 e0 00 00 01 f8 00 00 00 00 00 00
d0: 03 a8 00 80 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

(trim other device, no changes)
--
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead
E-Mail : ***@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
Loading...