Discussion:
lk 3.17-rc4 blk_mq large write problems
Douglas Gilbert
2014-09-10 03:55:02 UTC
Permalink
A few days ago I was trying to create a large file
(say 16 GB) of zeros on an ext4 file system:
dd if=/dev/zero bs=64k count=256k of=zero_16g.bin

After about 5 seconds there was a NULL de-reference that
crashed the machine (shown below). This was with a clean
version of lk 3.17-rc4 (from kernel.org) where the target
was a SATA SSD directly connected to a LSI 9300-4i SAS-3
HBA (mpt3sas). Significantly (IMO) the kernel boot line
contained:
scsi_mod.use_blk_mq=Y

In all cases changing that to "N" fixed the problem. I tried
many things, including a SAS SSD but the problem persisted when
use_blk_mq=Y. It doesn't always oops as shown in the first
case below. There were also:
- immediate reboots
- lock-ups without any oops on the console
- different oopses of a somewhat stranger nature
(hard to catch as logging everything on a real
serial port is fiddly) like double bus errors

Rob Elliott has been unable to replicate this problem.

Today I switched to another machine running Debian 7 (the
first machine was Ubuntu 14.04 based); both x86_64.
Built the same kernel on the second machine, this time
with a LSI 9212-4i4e SAS-2 HBA (mpt2sas) and a SAS SSD
directly connected. Roughly speaking it was the same
test case:
# <create 1 partition on say /dev/sdb>
# mkfs.ext4 /dev/sdb1
# mount /dev/sdb1 /mnt/spare
# cd /mnt/spare
# dd if=/dev/zero bs=64k count=256k of=zero_16g.bin
# cd
# umount /mnt/spare

Usually the dd or the umount would crash. Then after a
crash, following a power cycle, the mount would crash.
Changing to scsi_mod.use_blk_mq=N restored sanity.

Tried some other SAS controllers: couldn't get a MR-9240-4i
(MegaRaid) to work at all on my newer box (doesn't like
PCIe 3 ?). Got a ARC-1882I working and it did not have
problems with the big dd (perhaps the arcmsr driver still
uses the host_lock to serialize commands).

So it could be common, bad code in the mpt2sas and mpt3sas
drivers. Or it could be somewhere else. Perhaps there is
more than one problem.

Testers out there are encouraged to run the above test case.
The SATA and SAS SSDs that I used can consume writes in the
300 to 600 MB/sec range.

Part of the strangeness of this first attached oops is that
blk_mq_timeout_check() appears twice. The second one (typically
from the umount) is a blown stack.

Enjoy.
Doug Gilbert
Christoph Hellwig
2014-09-10 15:41:44 UTC
Permalink
While it might not help with a blown stack, can you give the patch below
a try? I tries to solve a problem where the timeout handler hits
before we've fully set up a command. While I'd like to understand the
root cause of why we're hitting it as well, I'd also really to fix that
race. It would also be good to get a gdb listing of the exact area in
scsi_times_out listed in the oops.

---
From: Christoph Hellwig <***@lst.de>
Subject: blk-mq: call blk_mq_start_request from ->queue_rq

When we call blk_mq_start_request from the core blk-mq code before calling into
->queue_rq there is a racy window where the timeout handler can hit before we've
fully set up the driver specific part of the command.

Move the call to blk_mq_start_request into the driver so the driver can start
the request only once it is fully set up.

Signed-off-by: Christoph Hellwig <***@lst.de>

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5189cb1..db9990b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -374,7 +374,7 @@ void blk_mq_complete_request(struct request *rq)
}
EXPORT_SYMBOL(blk_mq_complete_request);

-static void blk_mq_start_request(struct request *rq, bool last)
+void blk_mq_start_request(struct request *rq)
{
struct request_queue *q = rq->q;

@@ -405,29 +405,18 @@ static void blk_mq_start_request(struct request *rq, bool last)
*/
rq->nr_phys_segments++;
}
-
- /*
- * Flag the last request in the series so that drivers know when IO
- * should be kicked off, if they don't do it on a per-request basis.
- *
- * Note: the flag isn't the only condition drivers should do kick off.
- * If drive is busy, the last request might not have the bit set.
- */
- if (last)
- rq->cmd_flags |= REQ_END;
}
+EXPORT_SYMBOL(blk_mq_start_request);

static void __blk_mq_requeue_request(struct request *rq)
{
struct request_queue *q = rq->q;

trace_block_rq_requeue(q, rq);
- clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
-
- rq->cmd_flags &= ~REQ_END;
-
- if (q->dma_drain_size && blk_rq_bytes(rq))
- rq->nr_phys_segments--;
+ if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
+ if (q->dma_drain_size && blk_rq_bytes(rq))
+ rq->nr_phys_segments--;
+ }
}

void blk_mq_requeue_request(struct request *rq)
@@ -735,9 +724,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
rq = list_first_entry(&rq_list, struct request, queuelist);
list_del_init(&rq->queuelist);

- blk_mq_start_request(rq, list_empty(&rq_list));
-
- ret = q->mq_ops->queue_rq(hctx, rq);
+ ret = q->mq_ops->queue_rq(hctx, rq, list_empty(&rq_list));
switch (ret) {
case BLK_MQ_RQ_QUEUE_OK:
queued++;
@@ -1177,14 +1164,13 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
int ret;

blk_mq_bio_to_request(rq, bio);
- blk_mq_start_request(rq, true);

/*
* For OK queue, we are done. For error, kill it. Any other
* error (busy), just add it to our list as we previously
* would have done
*/
- ret = q->mq_ops->queue_rq(data.hctx, rq);
+ ret = q->mq_ops->queue_rq(data.hctx, rq, true);
if (ret == BLK_MQ_RQ_QUEUE_OK)
goto done;
else {
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index db1e956..9b0127a 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3775,13 +3775,16 @@ static bool mtip_check_unal_depth(struct blk_mq_hw_ctx *hctx,
return false;
}

-static int mtip_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq)
+static int mtip_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq,
+ bool last)
{
int ret;

if (unlikely(mtip_check_unal_depth(hctx, rq)))
return BLK_MQ_RQ_QUEUE_BUSY;

+ blk_mq_start_request(rq);
+
ret = mtip_submit_request(hctx, rq);
if (likely(!ret))
return BLK_MQ_RQ_QUEUE_OK;
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index a3b042c..d098adfbb 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -313,10 +313,13 @@ static void null_request_fn(struct request_queue *q)
}
}

-static int null_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq)
+static int null_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq,
+ bool last)
{
struct nullb_cmd *cmd = blk_mq_rq_to_pdu(rq);

+ blk_mq_start_request(rq);
+
cmd->rq = rq;
cmd->nq = hctx->driver_data;

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 0a58140..4b08906 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -164,14 +164,14 @@ static void virtblk_done(struct virtqueue *vq)
spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
}

-static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
+static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req,
+ bool last)
{
struct virtio_blk *vblk = hctx->queue->queuedata;
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
unsigned long flags;
unsigned int num;
int qid = hctx->queue_num;
- const bool last = (req->cmd_flags & REQ_END) != 0;
int err;
bool notify = false;

@@ -213,6 +213,8 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
}

+ blk_mq_start_request(req);
+
spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
err = __virtblk_add_req(vblk->vqs[qid].vq, vbr, vbr->sg, num);
if (err) {
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9c44392..dd6e912 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1856,7 +1856,8 @@ static void scsi_mq_done(struct scsi_cmnd *cmd)
blk_mq_complete_request(cmd->request);
}

-static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
+static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req,
+ bool last)
{
struct request_queue *q = req->q;
struct scsi_device *sdev = q->queuedata;
@@ -1890,6 +1891,8 @@ static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
scsi_init_cmd_errh(cmd);
cmd->scsi_done = scsi_mq_done;

+ blk_mq_start_request(req);
+
reason = scsi_dispatch_cmd(cmd);
if (reason) {
scsi_set_blocked(cmd, reason);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index eb726b9..aed92d5 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -77,7 +77,7 @@ struct blk_mq_tag_set {
struct list_head tag_list;
};

-typedef int (queue_rq_fn)(struct blk_mq_hw_ctx *, struct request *);
+typedef int (queue_rq_fn)(struct blk_mq_hw_ctx *, struct request *, bool);
typedef struct blk_mq_hw_ctx *(map_queue_fn)(struct request_queue *, const int);
typedef int (init_hctx_fn)(struct blk_mq_hw_ctx *, void *, unsigned int);
typedef void (exit_hctx_fn)(struct blk_mq_hw_ctx *, unsigned int);
@@ -160,6 +160,7 @@ struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag);
struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *, const int ctx_index);
struct blk_mq_hw_ctx *blk_mq_alloc_single_hw_queue(struct blk_mq_tag_set *, unsigned int, int);

+void blk_mq_start_request(struct request *rq);
void blk_mq_end_io(struct request *rq, int error);
void __blk_mq_end_io(struct request *rq, int error);

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 66c2167..bb7d664 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -188,7 +188,6 @@ enum rq_flag_bits {
__REQ_MIXED_MERGE, /* merge of different types, fail separately */
__REQ_KERNEL, /* direct IO to kernel pages */
__REQ_PM, /* runtime pm request */
- __REQ_END, /* last of chain of requests */
__REQ_HASHED, /* on IO scheduler merge hash */
__REQ_MQ_INFLIGHT, /* track inflight for MQ */
__REQ_NR_BITS, /* stops here */
@@ -242,7 +241,6 @@ enum rq_flag_bits {
#define REQ_SECURE (1ULL << __REQ_SECURE)
#define REQ_KERNEL (1ULL << __REQ_KERNEL)
#define REQ_PM (1ULL << __REQ_PM)
-#define REQ_END (1ULL << __REQ_END)
#define REQ_HASHED (1ULL << __REQ_HASHED)
#define REQ_MQ_INFLIGHT (1ULL << __REQ_MQ_INFLIGHT)

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jens Axboe
2014-09-10 16:47:49 UTC
Permalink
Post by Christoph Hellwig
While it might not help with a blown stack, can you give the patch below
a try? I tries to solve a problem where the timeout handler hits
before we've fully set up a command. While I'd like to understand the
root cause of why we're hitting it as well, I'd also really to fix that
race. It would also be good to get a gdb listing of the exact area in
scsi_times_out listed in the oops.
It's a really long window, but it does exist. I'd be curious if the
patch makes a difference for this weird case. I have not seen anything
like it here.

BTW, please don't mix up the REQ_END and ->queue_rq() changes with the
changed start_request API.
--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2014-09-10 18:09:57 UTC
Permalink
Post by Jens Axboe
BTW, please don't mix up the REQ_END and ->queue_rq() changes with the
changed start_request API.
I have to. It's set by start_request, so we need to pass down the last
argument to keep the old behavior. And once we pass the argument we
can just it directly.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jens Axboe
2014-09-10 18:26:57 UTC
Permalink
Post by Christoph Hellwig
Post by Jens Axboe
BTW, please don't mix up the REQ_END and ->queue_rq() changes with the
changed start_request API.
I have to. It's set by start_request, so we need to pass down the last
argument to keep the old behavior. And once we pass the argument we
can just it directly.
It could still be done in the caller, but arguably, you'd have to do it
twice unless the ->queue_rq() call was rolled into a function.
--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2014-09-10 18:40:04 UTC
Permalink
Post by Jens Axboe
Post by Christoph Hellwig
I have to. It's set by start_request, so we need to pass down the last
argument to keep the old behavior. And once we pass the argument we
can just it directly.
It could still be done in the caller, but arguably, you'd have to do it
twice unless the ->queue_rq() call was rolled into a function.
At which point we'd better do it the right way and just pass the flag
instead of wasting a request flag for it. The other benefit is that
there is a clear compile time API break for ->queue_rq that reminds
people that need to start using blk_mq_start_request.

Anyway, still waiting for test reports of the people that can reproduce
the timeouts. By the time I'll submit the patch it should have a much
better changelog.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jens Axboe
2014-09-10 18:42:09 UTC
Permalink
Post by Christoph Hellwig
Post by Jens Axboe
Post by Christoph Hellwig
I have to. It's set by start_request, so we need to pass down the last
argument to keep the old behavior. And once we pass the argument we
can just it directly.
It could still be done in the caller, but arguably, you'd have to do it
twice unless the ->queue_rq() call was rolled into a function.
At which point we'd better do it the right way and just pass the flag
instead of wasting a request flag for it. The other benefit is that
there is a clear compile time API break for ->queue_rq that reminds
people that need to start using blk_mq_start_request.
Yeah that's a good point, hopefully they will look up the commit and
figure out what to do. I'm not adverse to doing these two things at
once, but it should at least have a good mention of it in the changelog.
It's not even mentioned.
Post by Christoph Hellwig
Anyway, still waiting for test reports of the people that can reproduce
the timeouts. By the time I'll submit the patch it should have a much
better changelog.
Thanks, sounds good.
--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Douglas Gilbert
2014-09-11 00:58:51 UTC
Permalink
Post by Christoph Hellwig
While it might not help with a blown stack, can you give the patch below
a try? I tries to solve a problem where the timeout handler hits
before we've fully set up a command. While I'd like to understand the
root cause of why we're hitting it as well, I'd also really to fix that
race. It would also be good to get a gdb listing of the exact area in
scsi_times_out listed in the oops.
RIP: 0010:[<ffffffff8127cd2e>] [<ffffffff8127cd2e>] scsi_times_out+0xe/0x2e0

(gdb) disassemble scsi_times_out
Dump of assembler code for function scsi_times_out:
0xffffffff8127d030 <+0>: push %rbp
0xffffffff8127d031 <+1>: mov $0x2007,%esi
0xffffffff8127d036 <+6>: push %rbx
0xffffffff8127d037 <+7>: mov 0xf8(%rdi),%rbx
0xffffffff8127d03e <+14>: mov (%rbx),%rax
0xffffffff8127d041 <+17>: mov %rbx,%rdi
0xffffffff8127d044 <+20>: mov (%rax),%rbp
0xffffffff8127d047 <+23>: callq 0xffffffff81277c70 <scsi_log_completion>
0xffffffff8127d04c <+28>: cmpl $0xffffffff,0x154(%rbp)
0xffffffff8127d053 <+35>: je 0xffffffff8127d05f <scsi_times_out+47>
...

which seems to agree 'objdump -drS scsi_error.o':

00000000000028b0 <scsi_times_out>:
28b0: 55 push %rbp
28b1: be 07 20 00 00 mov $0x2007,%esi
28b6: 53 push %rbx
28b7: 48 8b 9f f8 00 00 00 mov 0xf8(%rdi),%rbx
28be: 48 8b 03 mov (%rbx),%rax
28c1: 48 89 df mov %rbx,%rdi
28c4: 48 8b 28 mov (%rax),%rbp
28c7: e8 00 00 00 00 callq 28cc <scsi_times_out+0x1c>
28c8: R_X86_64_PC32 scsi_log_completion-0x4
28cc: 83 bd 54 01 00 00 ff cmpl $0xffffffff,0x154(%rbp)
Post by Christoph Hellwig
Subject: blk-mq: call blk_mq_start_request from ->queue_rq
When we call blk_mq_start_request from the core blk-mq code before calling into
->queue_rq there is a racy window where the timeout handler can hit before we've
fully set up the driver specific part of the command.
Move the call to blk_mq_start_request into the driver so the driver can start
the request only once it is fully set up.
Using my original (newer) machine with a SAS SSD, today
I'm seeing only the "blown stack" oops on umount. And on
the next reboot, if use_blk_mq=Y then doing the mount
(on the SAS SSD) causes an instant reboot.

Same with and without this patch. I'll try again with the
SATA SSD (but I need to archive its contents first) and
maybe I can get back to the scsi_times_out oops.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jens Axboe
2014-09-11 02:00:21 UTC
Permalink
Post by Douglas Gilbert
Post by Christoph Hellwig
While it might not help with a blown stack, can you give the patch below
a try? I tries to solve a problem where the timeout handler hits
before we've fully set up a command. While I'd like to understand the
root cause of why we're hitting it as well, I'd also really to fix that
race. It would also be good to get a gdb listing of the exact area in
scsi_times_out listed in the oops.
RIP: 0010:[<ffffffff8127cd2e>] [<ffffffff8127cd2e>]
scsi_times_out+0xe/0x2e0
(gdb) disassemble scsi_times_out
0xffffffff8127d030 <+0>: push %rbp
0xffffffff8127d031 <+1>: mov $0x2007,%esi
0xffffffff8127d036 <+6>: push %rbx
0xffffffff8127d037 <+7>: mov 0xf8(%rdi),%rbx
0xffffffff8127d03e <+14>: mov (%rbx),%rax
0xffffffff8127d041 <+17>: mov %rbx,%rdi
0xffffffff8127d044 <+20>: mov (%rax),%rbp
0xffffffff8127d047 <+23>: callq 0xffffffff81277c70
<scsi_log_completion>
0xffffffff8127d04c <+28>: cmpl $0xffffffff,0x154(%rbp)
0xffffffff8127d053 <+35>: je 0xffffffff8127d05f
<scsi_times_out+47>
...
28b0: 55 push %rbp
28b1: be 07 20 00 00 mov $0x2007,%esi
28b6: 53 push %rbx
28b7: 48 8b 9f f8 00 00 00 mov 0xf8(%rdi),%rbx
28be: 48 8b 03 mov (%rbx),%rax
28c1: 48 89 df mov %rbx,%rdi
28c4: 48 8b 28 mov (%rax),%rbp
28c7: e8 00 00 00 00 callq 28cc <scsi_times_out+0x1c>
28c8: R_X86_64_PC32 scsi_log_completion-0x4
28cc: 83 bd 54 01 00 00 ff cmpl $0xffffffff,0x154(%rbp)
This would be more useful if you had DEBUGINFO enabled. At least it
would save use some time :-)
--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Elliott, Robert (Server Storage)
2014-09-11 03:48:29 UTC
Permalink
-----Original Message-----
Sent: Wednesday, September 10, 2014 9:00 PM
Cc: SCSI development list
Subject: Re: lk 3.17-rc4 blk_mq large write problems
Post by Douglas Gilbert
Post by Christoph Hellwig
While it might not help with a blown stack, can you give the patch
below
Post by Douglas Gilbert
Post by Christoph Hellwig
a try? I tries to solve a problem where the timeout handler hits
before we've fully set up a command. While I'd like to understand
the
Post by Douglas Gilbert
Post by Christoph Hellwig
root cause of why we're hitting it as well, I'd also really to fix
that
Post by Douglas Gilbert
Post by Christoph Hellwig
race. It would also be good to get a gdb listing of the exact area in
scsi_times_out listed in the oops.
RIP: 0010:[<ffffffff8127cd2e>] [<ffffffff8127cd2e>]
scsi_times_out+0xe/0x2e0
(gdb) disassemble scsi_times_out
0xffffffff8127d030 <+0>: push %rbp
0xffffffff8127d031 <+1>: mov $0x2007,%esi
0xffffffff8127d036 <+6>: push %rbx
0xffffffff8127d037 <+7>: mov 0xf8(%rdi),%rbx
0xffffffff8127d03e <+14>: mov (%rbx),%rax
0xffffffff8127d041 <+17>: mov %rbx,%rdi
0xffffffff8127d044 <+20>: mov (%rax),%rbp
0xffffffff8127d047 <+23>: callq 0xffffffff81277c70
<scsi_log_completion>
0xffffffff8127d04c <+28>: cmpl $0xffffffff,0x154(%rbp)
0xffffffff8127d053 <+35>: je 0xffffffff8127d05f
<scsi_times_out+47>
...
28b0: 55 push %rbp
28b1: be 07 20 00 00 mov $0x2007,%esi
28b6: 53 push %rbx
28b7: 48 8b 9f f8 00 00 00 mov 0xf8(%rdi),%rbx
28be: 48 8b 03 mov (%rbx),%rax
28c1: 48 89 df mov %rbx,%rdi
28c4: 48 8b 28 mov (%rax),%rbp
28c7: e8 00 00 00 00 callq 28cc
<scsi_times_out+0x1c>
Post by Douglas Gilbert
28c8: R_X86_64_PC32 scsi_log_completion-0x4
28cc: 83 bd 54 01 00 00 ff cmpl $0xffffffff,0x154(%rbp)
This would be more useful if you had DEBUGINFO enabled. At least it
would save use some time :-)
On my system, that function compiles to:

enum blk_eh_timer_return scsi_times_out(struct request *req)
{
2580: 55 push %rbp
2581: 48 89 e5 mov %rsp,%rbp
2584: 41 55 push %r13
2586: 41 54 push %r12
2588: 53 push %rbx
2589: 48 83 ec 08 sub $0x8,%rsp
258d: e8 00 00 00 00 callq 2592 <scsi_times_out+0x12>
258e: R_X86_64_PC32 mcount-0x4
struct scsi_cmnd *scmd = req->special;
2592: 48 8b 9f f8 00 00 00 mov 0xf8(%rdi),%rbx
enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
struct Scsi_Host *host = scmd->device->host;
2599: 48 8b 03 mov (%rbx),%rax
259c: 4c 8b 20 mov (%rax),%r12
259f: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)

So, Doug's +0xe corresponds to +0x19 here.

258d-2591 and the extra code preceding them are for stack
overflow checking, which I have enabled, and the location
for the optional ftrace jump if tracing this function is
turned on.

2592 is scmd = req->special. %rdi is req, the first and
only function argument (per the x86_64 abi). 0xf8(%rdi)
is referencing req->special. req must be OK, since that's
not the instruction pointer address that failed.

2599 is dereferencing req->special (i.e., scmd) to set host.
req->special must contain NULL, causing the exception
at this IP.

As a short-term workaround, it might be better to just let
this function exit if scmd is NULL, assuming that it was
already dealt with properly. That would still be misled
if the race condition, whatever it may be, allows scmd to
be reused before this function runs; this function would
then operate on a future command.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Douglas Gilbert
2014-09-11 05:37:03 UTC
Permalink
Post by Elliott, Robert (Server Storage)
-----Original Message-----
Sent: Wednesday, September 10, 2014 9:00 PM
Cc: SCSI development list
Subject: Re: lk 3.17-rc4 blk_mq large write problems
Post by Douglas Gilbert
Post by Christoph Hellwig
While it might not help with a blown stack, can you give the patch
below
Post by Douglas Gilbert
Post by Christoph Hellwig
a try? I tries to solve a problem where the timeout handler hits
before we've fully set up a command. While I'd like to understand
the
Post by Douglas Gilbert
Post by Christoph Hellwig
root cause of why we're hitting it as well, I'd also really to fix
that
Post by Douglas Gilbert
Post by Christoph Hellwig
race. It would also be good to get a gdb listing of the exact area in
scsi_times_out listed in the oops.
RIP: 0010:[<ffffffff8127cd2e>] [<ffffffff8127cd2e>]
scsi_times_out+0xe/0x2e0
(gdb) disassemble scsi_times_out
0xffffffff8127d030 <+0>: push %rbp
0xffffffff8127d031 <+1>: mov $0x2007,%esi
0xffffffff8127d036 <+6>: push %rbx
0xffffffff8127d037 <+7>: mov 0xf8(%rdi),%rbx
0xffffffff8127d03e <+14>: mov (%rbx),%rax
0xffffffff8127d041 <+17>: mov %rbx,%rdi
0xffffffff8127d044 <+20>: mov (%rax),%rbp
0xffffffff8127d047 <+23>: callq 0xffffffff81277c70
<scsi_log_completion>
0xffffffff8127d04c <+28>: cmpl $0xffffffff,0x154(%rbp)
0xffffffff8127d053 <+35>: je 0xffffffff8127d05f
<scsi_times_out+47>
...
28b0: 55 push %rbp
28b1: be 07 20 00 00 mov $0x2007,%esi
28b6: 53 push %rbx
28b7: 48 8b 9f f8 00 00 00 mov 0xf8(%rdi),%rbx
28be: 48 8b 03 mov (%rbx),%rax
28c1: 48 89 df mov %rbx,%rdi
28c4: 48 8b 28 mov (%rax),%rbp
28c7: e8 00 00 00 00 callq 28cc
<scsi_times_out+0x1c>
Post by Douglas Gilbert
28c8: R_X86_64_PC32 scsi_log_completion-0x4
28cc: 83 bd 54 01 00 00 ff cmpl $0xffffffff,0x154(%rbp)
This would be more useful if you had DEBUGINFO enabled. At least it
would save use some time :-)
enum blk_eh_timer_return scsi_times_out(struct request *req)
{
2580: 55 push %rbp
2581: 48 89 e5 mov %rsp,%rbp
2584: 41 55 push %r13
2586: 41 54 push %r12
2588: 53 push %rbx
2589: 48 83 ec 08 sub $0x8,%rsp
258d: e8 00 00 00 00 callq 2592 <scsi_times_out+0x12>
258e: R_X86_64_PC32 mcount-0x4
struct scsi_cmnd *scmd = req->special;
2592: 48 8b 9f f8 00 00 00 mov 0xf8(%rdi),%rbx
enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
struct Scsi_Host *host = scmd->device->host;
2599: 48 8b 03 mov (%rbx),%rax
259c: 4c 8b 20 mov (%rax),%r12
259f: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
So, Doug's +0xe corresponds to +0x19 here.
258d-2591 and the extra code preceding them are for stack
overflow checking, which I have enabled, and the location
for the optional ftrace jump if tracing this function is
turned on.
2592 is scmd = req->special. %rdi is req, the first and
only function argument (per the x86_64 abi). 0xf8(%rdi)
is referencing req->special. req must be OK, since that's
not the instruction pointer address that failed.
2599 is dereferencing req->special (i.e., scmd) to set host.
req->special must contain NULL, causing the exception
at this IP.
As a short-term workaround, it might be better to just let
this function exit if scmd is NULL, assuming that it was
already dealt with properly. That would still be misled
if the race condition, whatever it may be, allows scmd to
be reused before this function runs; this function would
then operate on a future command.
Switching back to a SATA SSD gives me the same
"scsi_times_out+0xe" NULL de-reference oops shown at the
start of this thread.

(gdb) list *(scsi_times_out+0xe)
0xffffffff812804fe is in scsi_times_out (drivers/scsi/scsi_error.c:277).
272 */
273 enum blk_eh_timer_return scsi_times_out(struct request *req)
274 {
275 struct scsi_cmnd *scmd = req->special;
276 enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
277 struct Scsi_Host *host = scmd->device->host;
278
279 trace_scsi_dispatch_cmd_timeout(scmd);
280 scsi_log_completion(scmd, TIMEOUT_ERROR);

As Rob suggested.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2014-09-17 17:04:38 UTC
Permalink
Doug,

can you test the latest for-linus branch in Jens' tree
(git://git.kernel.dk/linux-block). It has a fix for the infinite
recursion with the FUA flag, and a lot of timeout handling fixes.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Douglas Gilbert
2014-09-22 23:14:41 UTC
Permalink
Post by Douglas Gilbert
A few days ago I was trying to create a large file
dd if=/dev/zero bs=64k count=256k of=zero_16g.bin
After about 5 seconds there was a NULL de-reference that
crashed the machine (shown below). This was with a clean
version of lk 3.17-rc4 (from kernel.org) where the target
was a SATA SSD directly connected to a LSI 9300-4i SAS-3
HBA (mpt3sas). Significantly (IMO) the kernel boot line
scsi_mod.use_blk_mq=Y
In all cases changing that to "N" fixed the problem. I tried
many things, including a SAS SSD but the problem persisted when
use_blk_mq=Y. It doesn't always oops as shown in the first
- immediate reboots
- lock-ups without any oops on the console
- different oopses of a somewhat stranger nature
(hard to catch as logging everything on a real
serial port is fiddly) like double bus errors
Rob Elliott has been unable to replicate this problem.
Today I switched to another machine running Debian 7 (the
first machine was Ubuntu 14.04 based); both x86_64.
Built the same kernel on the second machine, this time
with a LSI 9212-4i4e SAS-2 HBA (mpt2sas) and a SAS SSD
directly connected. Roughly speaking it was the same
# <create 1 partition on say /dev/sdb>
# mkfs.ext4 /dev/sdb1
# mount /dev/sdb1 /mnt/spare
# cd /mnt/spare
# dd if=/dev/zero bs=64k count=256k of=zero_16g.bin
# cd
# umount /mnt/spare
Usually the dd or the umount would crash. Then after a
crash, following a power cycle, the mount would crash.
Changing to scsi_mod.use_blk_mq=N restored sanity.
Tried some other SAS controllers: couldn't get a MR-9240-4i
(MegaRaid) to work at all on my newer box (doesn't like
PCIe 3 ?). Got a ARC-1882I working and it did not have
problems with the big dd (perhaps the arcmsr driver still
uses the host_lock to serialize commands).
So it could be common, bad code in the mpt2sas and mpt3sas
drivers. Or it could be somewhere else. Perhaps there is
more than one problem.
Testers out there are encouraged to run the above test case.
The SATA and SAS SSDs that I used can consume writes in the
300 to 600 MB/sec range.
Part of the strangeness of this first attached oops is that
blk_mq_timeout_check() appears twice. The second one (typically
from the umount) is a blown stack.
Using the block/for-linus tree that I built today,
the freeze-during-boot-up problem has gone away as
reported earlier.

That allows me to retest the problem reported in this
thread with the same disk (INTEL SSDSA2M080) and the
same configuration. Just did four cycles of the test
sequence shown above plus a shutdown. No problems seen.

Doug Gilbert

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...