Discussion:
[dm-devel] [PATCH 1/1] multipath-tools: Change path checker for IBM IPR devices
Christoph Hellwig
2014-09-25 16:57:43 UTC
Permalink
The issue we've run into started when this patch started making its
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/scsi_error.c?id=14216561e164671ce147458653b1fea06a4ada1e
That changed the behaviour for user initiated TUR commands. After an ipr
adapter gets reset, all disk array devices require a start unit command
to be issued to them before they will accept commands. So, with the SCSI
EH change, we now end up in a scenario with dual ipr adapters where the
TUR getting issued from the health checker returns with a Not Ready response
and since SCSI EH no longer triggers the Start Unit in this scenario,
the path never recovers.
The alternative solution would be to change the TUR path checker in multipath-tools
to issue a Start Unit if it sees a 02/04/02.
Or we could fix up the check introduced by the commit, with something
ala:

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index a2c3d3d..7228d9e 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -459,13 +459,18 @@ static int scsi_check_sense(struct scsi_cmnd *scmd)
if (! scsi_command_normalize_sense(scmd, &sshdr))
return FAILED; /* no valid sense data */

- if (scmd->cmnd[0] == TEST_UNIT_READY && scmd->scsi_done != scsi_eh_done)
+ if (scmd->cmnd[0] == TEST_UNIT_READY &&
+ scmd->request->cmd_type == REQ_TYPE_FS &&
+ scmd->scsi_done != scsi_eh_done) {
/*
* nasty: for mid-layer issued TURs, we need to return the
* actual sense data without any recovery attempt. For eh
- * issued ones, we need to try to recover and interpret
+ * issued ones, we need to try to recover and interpret,
+ * and for pass through TURs we just need to stay out of the
+ * way, so that the device handlers can do the right thing.
*/
return SUCCESS;
+ }

scsi_report_sense(sdev, &sshdr);
Thanks,
Brian
--
Brian King
Power Linux I/O
IBM Linux Technology Center
--
dm-devel mailing list
https://www.redhat.com/mailman/listinfo/dm-devel
---end quoted text---
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
w***@linux.vnet.ibm.com
2014-09-30 18:05:47 UTC
Permalink
Post by Christoph Hellwig
The issue we've run into started when this patch started making its
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/scsi_error.c?id=14216561e164671ce147458653b1fea06a4ada1e
That changed the behaviour for user initiated TUR commands. After an ipr
adapter gets reset, all disk array devices require a start unit command
to be issued to them before they will accept commands. So, with the SCSI
EH change, we now end up in a scenario with dual ipr adapters where the
TUR getting issued from the health checker returns with a Not Ready response
and since SCSI EH no longer triggers the Start Unit in this scenario,
the path never recovers.
The alternative solution would be to change the TUR path checker in multipath-tools
to issue a Start Unit if it sees a 02/04/02.
Or we could fix up the check introduced by the commit, with something
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index a2c3d3d..7228d9e 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -459,13 +459,18 @@ static int scsi_check_sense(struct scsi_cmnd *scmd)
if (! scsi_command_normalize_sense(scmd, &sshdr))
return FAILED; /* no valid sense data */
- if (scmd->cmnd[0] == TEST_UNIT_READY && scmd->scsi_done != scsi_eh_done)
+ if (scmd->cmnd[0] == TEST_UNIT_READY &&
+ scmd->request->cmd_type == REQ_TYPE_FS &&
+ scmd->scsi_done != scsi_eh_done) {
/*
* nasty: for mid-layer issued TURs, we need to return the
* actual sense data without any recovery attempt. For eh
- * issued ones, we need to try to recover and interpret
+ * issued ones, we need to try to recover and interpret,
+ * and for pass through TURs we just need to stay out of the
+ * way, so that the device handlers can do the right thing.
*/
return SUCCESS;
+ }
scsi_report_sense(sdev, &sshdr);
Hi Christoph,

We have verified above patch in our test group system yesterday and
today. It works fine with their testcases.

Thanks,
Wendy
Post by Christoph Hellwig
Thanks,
Brian
--
Brian King
Power Linux I/O
IBM Linux Technology Center
--
dm-devel mailing list
https://www.redhat.com/mailman/listinfo/dm-devel
---end quoted text---
--
dm-devel mailing list
https://www.redhat.com/mailman/listinfo/dm-devel
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2014-10-01 12:51:34 UTC
Permalink
Unfortunately the patch wasn't quite correct - all TEST_UNIT_READY
commands are sent as BLOCK_PC, so this would basically revert James'
original fix for the SATL case.

Am I right to assume you only need the call to scsi_dh->check_sense and
not the rest of the handling for the multipath path checker? If that's
the case something like the patch below sould work:

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 5db8454..399c1c8 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -459,14 +459,6 @@ static int scsi_check_sense(struct scsi_cmnd *scmd)
if (! scsi_command_normalize_sense(scmd, &sshdr))
return FAILED; /* no valid sense data */

- if (scmd->cmnd[0] == TEST_UNIT_READY && scmd->scsi_done != scsi_eh_done)
- /*
- * nasty: for mid-layer issued TURs, we need to return the
- * actual sense data without any recovery attempt. For eh
- * issued ones, we need to try to recover and interpret
- */
- return SUCCESS;
-
scsi_report_sense(sdev, &sshdr);

if (scsi_sense_is_deferred(&sshdr))
@@ -482,6 +474,14 @@ static int scsi_check_sense(struct scsi_cmnd *scmd)
/* handler does not care. Drop down to default handling */
}

+ if (scmd->cmnd[0] == TEST_UNIT_READY && scmd->scsi_done != scsi_eh_done)
+ /*
+ * nasty: for mid-layer issued TURs, we need to return the
+ * actual sense data without any recovery attempt. For eh
+ * issued ones, we need to try to recover and interpret
+ */
+ return SUCCESS;
+
/*
* Previous logic looked for FILEMARK, EOM or ILI which are
* mainly associated with tapes and returned SUCCESS.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Brian King
2014-10-06 15:22:13 UTC
Permalink
Post by Christoph Hellwig
Unfortunately the patch wasn't quite correct - all TEST_UNIT_READY
commands are sent as BLOCK_PC, so this would basically revert James'
original fix for the SATL case.
Am I right to assume you only need the call to scsi_dh->check_sense and
not the rest of the handling for the multipath path checker? If that's
This would work if we also duplicated the 02/04/02 K/C/Q check in alua_check_sense
handler.

Wendy - can you try my patch below, along with Christoph's latest patch here
and see if that resolves the issue?

Thanks,

Brian
Post by Christoph Hellwig
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 5db8454..399c1c8 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -459,14 +459,6 @@ static int scsi_check_sense(struct scsi_cmnd *scmd)
if (! scsi_command_normalize_sense(scmd, &sshdr))
return FAILED; /* no valid sense data */
- if (scmd->cmnd[0] == TEST_UNIT_READY && scmd->scsi_done != scsi_eh_done)
- /*
- * nasty: for mid-layer issued TURs, we need to return the
- * actual sense data without any recovery attempt. For eh
- * issued ones, we need to try to recover and interpret
- */
- return SUCCESS;
-
scsi_report_sense(sdev, &sshdr);
if (scsi_sense_is_deferred(&sshdr))
@@ -482,6 +474,14 @@ static int scsi_check_sense(struct scsi_cmnd *scmd)
/* handler does not care. Drop down to default handling */
}
+ if (scmd->cmnd[0] == TEST_UNIT_READY && scmd->scsi_done != scsi_eh_done)
+ /*
+ * nasty: for mid-layer issued TURs, we need to return the
+ * actual sense data without any recovery attempt. For eh
+ * issued ones, we need to try to recover and interpret
+ */
+ return SUCCESS;
+
/*
* Previous logic looked for FILEMARK, EOM or ILI which are
* mainly associated with tapes and returned SUCCESS.
Signed-off-by: Brian King <***@linux.vnet.ibm.com>
---

drivers/scsi/device_handler/scsi_dh_alua.c | 7 +++++++
1 file changed, 7 insertions(+)

diff -puN drivers/scsi/device_handler/scsi_dh_alua.c~alua_allow_restart drivers/scsi/device_handler/scsi_dh_alua.c
--- linux/drivers/scsi/device_handler/scsi_dh_alua.c~alua_allow_restart 2014-10-06 10:19:16.184798305 -0500
+++ linux-bjking1/drivers/scsi/device_handler/scsi_dh_alua.c 2014-10-06 10:20:35.743165951 -0500
@@ -474,6 +474,13 @@ static int alua_check_sense(struct scsi_
* LUN Not Ready -- Offline
*/
return SUCCESS;
+ if (sdev->allow_restart &&
+ (sense_hdr->asc == 0x04) && (sense_hdr->ascq == 0x02))
+ /*
+ * if the device is not started, we need to wake
+ * the error handler to start the motor
+ */
+ return FAILED;
break;
case UNIT_ATTENTION:
if (sense_hdr->asc == 0x29 && sense_hdr->ascq == 0x00)
_

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
w***@linux.vnet.ibm.com
2014-10-06 21:50:32 UTC
Permalink
Post by Brian King
Post by Christoph Hellwig
Unfortunately the patch wasn't quite correct - all TEST_UNIT_READY
commands are sent as BLOCK_PC, so this would basically revert James'
original fix for the SATL case.
Am I right to assume you only need the call to scsi_dh->check_sense and
not the rest of the handling for the multipath path checker? If that's
This would work if we also duplicated the 02/04/02 K/C/Q check in alua_check_sense
handler.
Wendy - can you try my patch below, along with Christoph's latest patch here
and see if that resolves the issue?
Thanks,
Brian
Post by Christoph Hellwig
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 5db8454..399c1c8 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -459,14 +459,6 @@ static int scsi_check_sense(struct scsi_cmnd *scmd)
if (! scsi_command_normalize_sense(scmd, &sshdr))
return FAILED; /* no valid sense data */
- if (scmd->cmnd[0] == TEST_UNIT_READY && scmd->scsi_done != scsi_eh_done)
- /*
- * nasty: for mid-layer issued TURs, we need to return the
- * actual sense data without any recovery attempt. For eh
- * issued ones, we need to try to recover and interpret
- */
- return SUCCESS;
-
scsi_report_sense(sdev, &sshdr);
if (scsi_sense_is_deferred(&sshdr))
@@ -482,6 +474,14 @@ static int scsi_check_sense(struct scsi_cmnd *scmd)
/* handler does not care. Drop down to default handling */
}
+ if (scmd->cmnd[0] == TEST_UNIT_READY && scmd->scsi_done != scsi_eh_done)
+ /*
+ * nasty: for mid-layer issued TURs, we need to return the
+ * actual sense data without any recovery attempt. For eh
+ * issued ones, we need to try to recover and interpret
+ */
+ return SUCCESS;
+
/*
* Previous logic looked for FILEMARK, EOM or ILI which are
* mainly associated with tapes and returned SUCCESS.
---
drivers/scsi/device_handler/scsi_dh_alua.c | 7 +++++++
1 file changed, 7 insertions(+)
diff -puN
drivers/scsi/device_handler/scsi_dh_alua.c~alua_allow_restart
drivers/scsi/device_handler/scsi_dh_alua.c
---
linux/drivers/scsi/device_handler/scsi_dh_alua.c~alua_allow_restart 2014-10-06 10:19:16.184798305
-0500
+++
linux-bjking1/drivers/scsi/device_handler/scsi_dh_alua.c 2014-10-06
10:20:35.743165951 -0500
@@ -474,6 +474,13 @@ static int alua_check_sense(struct scsi_
* LUN Not Ready -- Offline
*/
return SUCCESS;
+ if (sdev->allow_restart &&
+ (sense_hdr->asc == 0x04) && (sense_hdr->ascq == 0x02))
+ /*
+ * if the device is not started, we need to wake
+ * the error handler to start the motor
+ */
+ return FAILED;
break;
if (sense_hdr->asc == 0x29 && sense_hdr->ascq == 0x00)
_
--
Sorry it took some time since we need to re-config the systems for this test.

With Christoph's new patch only, still saw the failure.
With Christoph's new patch + Brian's patch, works fine, didn't see the
failure.


Thanks,
Wendy
Post by Brian King
dm-devel mailing list
https://www.redhat.com/mailman/listinfo/dm-devel
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2014-10-21 11:03:03 UTC
Permalink
Post by w***@linux.vnet.ibm.com
Sorry it took some time since we need to re-config the systems for this test.
With Christoph's new patch only, still saw the failure.
With Christoph's new patch + Brian's patch, works fine, didn't see the
failure.
Can one of you send me a tested series with both patches?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...