Discussion:
boot stall regression due to blk-mq: use percpu_ref for mq usage count
Christoph Hellwig
2014-09-19 11:38:15 UTC
Permalink
Hi Jens, hi Tejun,

I've seen multi-second boot stalls in one of my KVM setups during
the initial scsi scan:

[ 0.949892] scsi host0: Virtio SCSI HBA
[ 1.007864] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
[ 1.021299] scsi 0:0:1:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
[ 1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz

<stall>

[ 16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0
[ 16.194099] osd: LOADED open-osd 0.2.1
[ 16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[ 16.208478] sd 0:0:0:0: [sda] Write Protect is off
[ 16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[ 16.223264] sd 0:0:1:0: [sdb] Write Protect is off
[ 16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA



I've tracked this down to "blk-mq: use percpu_ref for mq usage count" in
a rather painful way as that one introduced enough other regressions
to mess up bisect.

If I revert the following commits:

dd840087086f3b93ac20f7472b4fca59aff7b79f
cddd5d17642cc6881352732693c2ae6930e9ce65
add703fda981b9719d37f371498b9f129acbd997

which are the above mentioned commit and two fixes to it the problem goes
away.

My qemu command line is below:

kvm \
-m 2048 \
-smp 1 \
-kernel arch/x86/boot/bzImage \
-append "root=/dev/vda console=tty0 console=ttyS0,115200n8 scsi_mod.use_blk_mq=Y" \
-nographic \
-drive if=virtio,file=/work/images/debian.qcow2,cache=none,serial="test1234" \
-drive if=none,id=test,file=/work/images/test.img,cache=none,aio=native \
-drive if=none,id=scratch,file=/work/images/scratch.img,cache=none,aio=native \
-device virtio-scsi-pci,id=scsi \
-device scsi-hd,drive=test \
-device scsi-hd,drive=scratch \
-drive file=/work/images/debian-7.3.0-amd64-netinst.iso,index=2,media=cdrom
Jens Axboe
2014-09-19 19:13:11 UTC
Permalink
Post by Christoph Hellwig
Hi Jens, hi Tejun,
I've seen multi-second boot stalls in one of my KVM setups during
[ 0.949892] scsi host0: Virtio SCSI HBA
[ 1.007864] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
[ 1.021299] scsi 0:0:1:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
[ 1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz
<stall>
[ 16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0
[ 16.194099] osd: LOADED open-osd 0.2.1
[ 16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[ 16.208478] sd 0:0:0:0: [sda] Write Protect is off
[ 16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[ 16.223264] sd 0:0:1:0: [sdb] Write Protect is off
[ 16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
I've tracked this down to "blk-mq: use percpu_ref for mq usage count" in
a rather painful way as that one introduced enough other regressions
to mess up bisect.
dd840087086f3b93ac20f7472b4fca59aff7b79f
cddd5d17642cc6881352732693c2ae6930e9ce65
add703fda981b9719d37f371498b9f129acbd997
which are the above mentioned commit and two fixes to it the problem goes
away.
Thanks for bisecting this, I ran into something that I think is the same
issue about a week ago. Tejun, any ideas before I dig into this one?
--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2014-09-23 05:55:54 UTC
Permalink
Jens,

can we simply get these commits reverted from now if there's no better
fix? I'd hate to have this boot stall in the first kernel with blk-mq
support for scsi.
Tejun Heo
2014-09-23 05:56:48 UTC
Permalink
Post by Christoph Hellwig
Jens,
can we simply get these commits reverted from now if there's no better
fix? I'd hate to have this boot stall in the first kernel with blk-mq
support for scsi.
Patches going out right now.

Thanks.
--
tejun
Tejun Heo
2014-09-23 05:57:24 UTC
Permalink
Post by Tejun Heo
Post by Christoph Hellwig
Jens,
can we simply get these commits reverted from now if there's no better
fix? I'd hate to have this boot stall in the first kernel with blk-mq
support for scsi.
Patches going out right now.
And the original implementation was broken, so...
--
tejun
Christoph Hellwig
2014-09-23 05:59:24 UTC
Permalink
Post by Tejun Heo
Post by Christoph Hellwig
Jens,
can we simply get these commits reverted from now if there's no better
fix? I'd hate to have this boot stall in the first kernel with blk-mq
support for scsi.
Patches going out right now.
"[PATCHSET percpu/for-3.18] percpu_ref: implement switch_to_atomic/percpu()"

looks way to big for 3.17, and the regression was introduced in the 3.17
merge window. I'm not sure what was broken before, but it defintively
survived a lot of testing.
Tejun Heo
2014-09-23 06:01:41 UTC
Permalink
Post by Christoph Hellwig
"[PATCHSET percpu/for-3.18] percpu_ref: implement switch_to_atomic/percpu()"
looks way to big for 3.17, and the regression was introduced in the 3.17
merge window. I'm not sure what was broken before, but it defintively
survived a lot of testing.
Do we even care about fixing it for 3.17? scsi-mq isn't enabled by
default even for 3.18. The open-coded percpu ref thing was subtly
broken there. It'd be difficult to trigger but I'm fairly sure it'd
crap out in the wild once in a blue moon.

Thanks.
--
tejun
Tejun Heo
2014-09-23 06:03:46 UTC
Permalink
Post by Tejun Heo
Post by Christoph Hellwig
"[PATCHSET percpu/for-3.18] percpu_ref: implement switch_to_atomic/percpu()"
looks way to big for 3.17, and the regression was introduced in the 3.17
merge window. I'm not sure what was broken before, but it defintively
survived a lot of testing.
Do we even care about fixing it for 3.17? scsi-mq isn't enabled by
default even for 3.18. The open-coded percpu ref thing was subtly
broken there. It'd be difficult to trigger but I'm fairly sure it'd
crap out in the wild once in a blue moon.
Hmmm... also, this is actually an issue with scsi that block layer is
working around. For other drivers (virtio), the latency during
shutdown should be completely fine.
--
tejun
Christoph Hellwig
2014-09-23 06:09:06 UTC
Permalink
Post by Tejun Heo
Post by Christoph Hellwig
"[PATCHSET percpu/for-3.18] percpu_ref: implement switch_to_atomic/percpu()"
looks way to big for 3.17, and the regression was introduced in the 3.17
merge window. I'm not sure what was broken before, but it defintively
survived a lot of testing.
Do we even care about fixing it for 3.17? scsi-mq isn't enabled by
default even for 3.18. The open-coded percpu ref thing was subtly
broken there. It'd be difficult to trigger but I'm fairly sure it'd
crap out in the wild once in a blue moon.
It's compiled in by default, and people are extremly eager to test it.
Tejun Heo
2014-09-23 06:11:52 UTC
Permalink
Post by Christoph Hellwig
Post by Tejun Heo
Post by Christoph Hellwig
"[PATCHSET percpu/for-3.18] percpu_ref: implement switch_to_atomic/percpu()"
looks way to big for 3.17, and the regression was introduced in the 3.17
merge window. I'm not sure what was broken before, but it defintively
survived a lot of testing.
Do we even care about fixing it for 3.17? scsi-mq isn't enabled by
default even for 3.18. The open-coded percpu ref thing was subtly
broken there. It'd be difficult to trigger but I'm fairly sure it'd
crap out in the wild once in a blue moon.
It's compiled in by default, and people are extremly eager to test it.
Ugh, I don't know. It's not like we have a very good baseline we can
go back to and reverting it for -stable and then redoing it seems
kinda excessive for a yet experimental feature. Jens?
--
tejun
Elliott, Robert (Server Storage)
2014-09-23 14:57:48 UTC
Permalink
-----Original Message-----
Sent: Tuesday, 23 September, 2014 1:12 AM
To: Christoph Hellwig
Subject: Re: boot stall regression due to blk-mq: use percpu_ref for mq usage
count
Post by Christoph Hellwig
Post by Tejun Heo
Post by Christoph Hellwig
"[PATCHSET percpu/for-3.18] percpu_ref: implement
switch_to_atomic/percpu()"
Post by Christoph Hellwig
Post by Tejun Heo
Post by Christoph Hellwig
looks way to big for 3.17, and the regression was introduced in the
3.17
Post by Christoph Hellwig
Post by Tejun Heo
Post by Christoph Hellwig
merge window. I'm not sure what was broken before, but it defintively
survived a lot of testing.
Do we even care about fixing it for 3.17? scsi-mq isn't enabled by
default even for 3.18. The open-coded percpu ref thing was subtly
broken there. It'd be difficult to trigger but I'm fairly sure it'd
crap out in the wild once in a blue moon.
It's compiled in by default, and people are extremly eager to test it.
Ugh, I don't know. It's not like we have a very good baseline we can
go back to and reverting it for -stable and then redoing it seems
kinda excessive for a yet experimental feature. Jens?
scsi_mod.use_blk_mq is not listed in 3.17rc6 Documentation/kernel-
parameters.txt or Documentation/scsi/scsi-parameters.txt (which
does admit "This document may not be entirely up to date and
comprehensive.").

Perhaps a description with an "experimental" warning could be
slipped into scsi-parameters.txt in 3.17, with plans to remove
that warning in 3.18.

---
Rob Elliott HP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jens Axboe
2014-09-23 15:49:31 UTC
Permalink
Post by Tejun Heo
Post by Christoph Hellwig
Post by Tejun Heo
Post by Christoph Hellwig
"[PATCHSET percpu/for-3.18] percpu_ref: implement switch_to_atomic/percpu()"
looks way to big for 3.17, and the regression was introduced in the 3.17
merge window. I'm not sure what was broken before, but it defintively
survived a lot of testing.
Do we even care about fixing it for 3.17? scsi-mq isn't enabled by
default even for 3.18. The open-coded percpu ref thing was subtly
broken there. It'd be difficult to trigger but I'm fairly sure it'd
crap out in the wild once in a blue moon.
It's compiled in by default, and people are extremly eager to test it.
Ugh, I don't know. It's not like we have a very good baseline we can
go back to and reverting it for -stable and then redoing it seems
kinda excessive for a yet experimental feature. Jens?
It's not just scsi-mq, there are active users of blk-mq in the current
tree - like virtio_blk, mtip32xx. None of those are affected by the RCU
slowdown due to these changes, so it's not a big deal to them. But it is
a big deal if we can't tell people to test scsi-mq in 3.17, that was the
entire point of having it there but not default to on. So yeah, this
really should be fixed for 3.17.

I'm not aware of any reports on the existing enter count breaking things
for them. So while it may not be perfect, reverting the percpu ref count
changes for 3.17 may be the best option that we have.
--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo
2014-09-23 19:24:32 UTC
Permalink
blk-mq uses percpu_ref for its usage counter which tracks the number
of in-flight commands and used to synchronously drain the queue on
freeze. percpu_ref shutdown takes measureable wallclock time as it
involves a sched RCU grace period. This means that draining a blk-mq
takes measureable wallclock time. One would think that this shouldn't
matter as queue shutdown should be a rare event which takes place
asynchronously w.r.t. userland.

Unfortunately, SCSI probing involves synchronously setting up and then
tearing down a lot of request_queues back-to-back for non-existent
LUNs. This means that SCSI probing may take more than ten seconds
when scsi-mq is used.

This will be properly fixed by implementing a mechanism to keep
q->mq_usage_counter in atomic mode till genhd registration; however,
that involves rather big updates to percpu_ref which is difficult to
apply late in the devel cycle (v3.17-rc6 at the moment). As a
stop-gap measure till the proper fix can be implemented in the next
cycle, this patch introduces __percpu_ref_kill_expedited() and makes
blk_mq_freeze_queue() use it. This is heavy-handed but should work
for testing the experimental SCSI blk-mq implementation.

Signed-off-by: Tejun Heo <***@kernel.org>
Reported-by: Christoph Hellwig <***@infradead.org>
Link: http://lkml.kernel.org/g/***@lst.de
Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count")
Cc: Kent Overstreet <***@daterainc.com>
Cc: Jens Axboe <***@kernel.dk>
---
Hello, Jens, Christoph.

How about this one? This is kinda ugly but should work fine in most
cases and easy to apply to v3.17 and take out during v3.18.

Thanks.

block/blk-mq.c | 11 ++++++++++-
include/linux/percpu-refcount.h | 1 +
lib/percpu-refcount.c | 16 ++++++++++++++++
3 files changed, 27 insertions(+), 1 deletion(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -120,7 +120,16 @@ void blk_mq_freeze_queue(struct request_
spin_unlock_irq(q->queue_lock);

if (freeze) {
- percpu_ref_kill(&q->mq_usage_counter);
+ /*
+ * XXX: Temporary kludge to work around SCSI blk-mq stall.
+ * SCSI synchronously creates and destroys many queues
+ * back-to-back during probe leading to lengthy stalls.
+ * This will be fixed by keeping ->mq_usage_counter in
+ * atomic mode until genhd registration, but, for now,
+ * let's work around using expedited synchronization.
+ */
+ __percpu_ref_kill_expedited(&q->mq_usage_counter);
+
blk_mq_run_queues(q, false);
}
wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->mq_usage_counter));
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -72,6 +72,7 @@ void percpu_ref_reinit(struct percpu_ref
void percpu_ref_exit(struct percpu_ref *ref);
void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
percpu_ref_func_t *confirm_kill);
+void __percpu_ref_kill_expedited(struct percpu_ref *ref);

/**
* percpu_ref_kill - drop the initial ref
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -189,3 +189,19 @@ void percpu_ref_kill_and_confirm(struct
call_rcu_sched(&ref->rcu, percpu_ref_kill_rcu);
}
EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm);
+
+/*
+ * XXX: Temporary kludge to work around SCSI blk-mq stall. Used only by
+ * block/blk-mq.c::blk_mq_freeze_queue(). Will be removed during v3.18
+ * devel cycle. Do not use anywhere else.
+ */
+void __percpu_ref_kill_expedited(struct percpu_ref *ref)
+{
+ WARN_ONCE(ref->pcpu_count_ptr & PCPU_REF_DEAD,
+ "percpu_ref_kill() called more than once on %pf!",
+ ref->release);
+
+ ref->pcpu_count_ptr |= PCPU_REF_DEAD;
+ synchronize_sched_expedited();
+ percpu_ref_kill_rcu(&ref->rcu);
+}
Jens Axboe
2014-09-23 19:29:03 UTC
Permalink
Post by Tejun Heo
blk-mq uses percpu_ref for its usage counter which tracks the number
of in-flight commands and used to synchronously drain the queue on
freeze. percpu_ref shutdown takes measureable wallclock time as it
involves a sched RCU grace period. This means that draining a blk-mq
takes measureable wallclock time. One would think that this shouldn't
matter as queue shutdown should be a rare event which takes place
asynchronously w.r.t. userland.
Unfortunately, SCSI probing involves synchronously setting up and then
tearing down a lot of request_queues back-to-back for non-existent
LUNs. This means that SCSI probing may take more than ten seconds
when scsi-mq is used.
This will be properly fixed by implementing a mechanism to keep
q->mq_usage_counter in atomic mode till genhd registration; however,
that involves rather big updates to percpu_ref which is difficult to
apply late in the devel cycle (v3.17-rc6 at the moment). As a
stop-gap measure till the proper fix can be implemented in the next
cycle, this patch introduces __percpu_ref_kill_expedited() and makes
blk_mq_freeze_queue() use it. This is heavy-handed but should work
for testing the experimental SCSI blk-mq implementation.
Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count")
---
Hello, Jens, Christoph.
How about this one? This is kinda ugly but should work fine in most
cases and easy to apply to v3.17 and take out during v3.18.
The hack looks fine to me, if it passes Christophs boot stall test. The
hack isn't THAT nasty, since you already have a series queued up to fix
it for real for 3.18.
--
Jens Axboe
Tejun Heo
2014-09-23 19:48:34 UTC
Permalink
Oops, I forgot to cc Kent. Kent, the patch is at
Post by Jens Axboe
The hack looks fine to me, if it passes Christophs boot stall test. The
hack isn't THAT nasty, since you already have a series queued up to fix
it for real for 3.18.
Yeah, then I can pull in block fixes branch into a percpu for-3.18
branch, revert the hack and then apply the patches to implement the
proper fix.

Thanks.
--
tejun
Christoph Hellwig
2014-09-24 08:23:48 UTC
Permalink
Thanks, this fixes the boot stall for me.

Tested-by: Christoph Hellwig <***@lst.de>
Jens Axboe
2014-09-24 14:30:33 UTC
Permalink
Post by Christoph Hellwig
Thanks, this fixes the boot stall for me.
Sweet, I'll shove this off today, so we are done for 3.17 release.
--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo
2014-09-24 14:33:20 UTC
Permalink
Post by Jens Axboe
Post by Christoph Hellwig
Thanks, this fixes the boot stall for me.
Sweet, I'll shove this off today, so we are done for 3.17 release.
Cool, once it hits the block branch, I'll pull it into percpu/for-3.18
and apply the patches for proper fix on top.

Thanks.
--
tejun
Jens Axboe
2014-09-24 14:33:58 UTC
Permalink
Post by Tejun Heo
Post by Jens Axboe
Post by Christoph Hellwig
Thanks, this fixes the boot stall for me.
Sweet, I'll shove this off today, so we are done for 3.17 release.
Cool, once it hits the block branch, I'll pull it into percpu/for-3.18
and apply the patches for proper fix on top.
It's in for-linus now.
--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo
2014-09-24 17:20:51 UTC
Permalink
From 9eca80461a45177e456219a9cd944c27675d6512 Mon Sep 17 00:00:00 2001
From: Tejun Heo <***@kernel.org>
Date: Wed, 24 Sep 2014 13:07:33 -0400

This reverts commit 0a30288da1aec914e158c2d7a3482a85f632750f, which
was a temporary fix for SCSI blk-mq stall issue. The following
patches will fix the issue properly by introducing atomic mode to
percpu_ref.

Signed-off-by: Tejun Heo <***@kernel.org>
Cc: Kent Overstreet <***@daterainc.com>
Cc: Jens Axboe <***@kernel.dk>
Cc: Christoph Hellwig <***@lst.de>
---
Hello,

I pulled in block/for-linus into percpu/for-3.18 and reverted the temp
fix. Will apply the patchset to implement the proper fix on top of
it.

Thanks.

block/blk-mq.c | 11 +----------
include/linux/percpu-refcount.h | 1 -
lib/percpu-refcount.c | 16 ----------------
3 files changed, 1 insertion(+), 27 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 255d79c..44a78ae 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -119,16 +119,7 @@ void blk_mq_freeze_queue(struct request_queue *q)
spin_unlock_irq(q->queue_lock);

if (freeze) {
- /*
- * XXX: Temporary kludge to work around SCSI blk-mq stall.
- * SCSI synchronously creates and destroys many queues
- * back-to-back during probe leading to lengthy stalls.
- * This will be fixed by keeping ->mq_usage_counter in
- * atomic mode until genhd registration, but, for now,
- * let's work around using expedited synchronization.
- */
- __percpu_ref_kill_expedited(&q->mq_usage_counter);
-
+ percpu_ref_kill(&q->mq_usage_counter);
blk_mq_run_queues(q, false);
}
wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->mq_usage_counter));
diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 11b38ce..5df6784 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -72,7 +72,6 @@ void percpu_ref_reinit(struct percpu_ref *ref);
void percpu_ref_exit(struct percpu_ref *ref);
void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
percpu_ref_func_t *confirm_kill);
-void __percpu_ref_kill_expedited(struct percpu_ref *ref);

/**
* percpu_ref_kill - drop the initial ref
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index c6c31e2..559ee0b 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -189,19 +189,3 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
call_rcu_sched(&ref->rcu, percpu_ref_kill_rcu);
}
EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm);
-
-/*
- * XXX: Temporary kludge to work around SCSI blk-mq stall. Used only by
- * block/blk-mq.c::blk_mq_freeze_queue(). Will be removed during v3.18
- * devel cycle. Do not use anywhere else.
- */
-void __percpu_ref_kill_expedited(struct percpu_ref *ref)
-{
- WARN_ONCE(ref->pcpu_count_ptr & PCPU_REF_DEAD,
- "percpu_ref_kill() called more than once on %pf!",
- ref->release);
-
- ref->pcpu_count_ptr |= PCPU_REF_DEAD;
- synchronize_sched_expedited();
- percpu_ref_kill_rcu(&ref->rcu);
-}
--
1.9.3
Tejun Heo
2014-09-23 06:08:12 UTC
Permalink
From 83b06f4fc6ca2f7f3d706a168b71c248bdada668 Mon Sep 17 00:00:00 2001
From: Tejun Heo <***@kernel.org>
Date: Tue, 23 Sep 2014 01:58:34 -0400

blk-mq uses percpu_ref for its usage counter which tracks the number
of in-flight commands and used to synchronously drain the queue on
freeze. percpu_ref shutdown takes measureable wallclock time as it
involves a sched RCU grace period. This means that draining a blk-mq
takes measureable wallclock time. One would think that this shouldn't
matter as queue shutdown should be a rare event which takes place
asynchronously w.r.t. userland.

Unfortunately, SCSI probing involves synchronously setting up and then
tearing down a lot of request_queues back-to-back for non-existent
LUNs. This means that SCSI probing may take above ten seconds when
scsi-mq is used.

[ 0.949892] scsi host0: Virtio SCSI HBA
[ 1.007864] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
[ 1.021299] scsi 0:0:1:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
[ 1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz

<stall>

[ 16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0
[ 16.194099] osd: LOADED open-osd 0.2.1
[ 16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[ 16.208478] sd 0:0:0:0: [sda] Write Protect is off
[ 16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[ 16.223264] sd 0:0:1:0: [sdb] Write Protect is off
[ 16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

This is also the reason why request_queues start in bypass mode which
is ended on blk_register_queue() as shutting down a fully functional
queue also involves a RCU grace period and the queues for non-existent
SCSI devices never reach registration.

blk-mq basically needs to do the same thing - start the mq in a
degraded mode which is faster to shut down and then make it fully
functional only after the queue reaches registration. percpu_ref
recently grew facilities to force atomic operation until explicitly
switched to percpu mode, which can be used for this purpose. This
patch makes blk-mq initialize q->mq_usage_counter in atomic mode and
switch it to percpu mode only once blk_register_queue() is reached.

Signed-off-by: Tejun Heo <***@kernel.org>
Reported-by: Christoph Hellwig <***@infradead.org>
Link: http://lkml.kernel.org/g/***@lst.de
Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count")
Cc: Kent Overstreet <***@daterainc.com>
Cc: Jens Axboe <***@kernel.dk>
Cc: Johannes Weiner <***@cmpxchg.org>
---
Hello,

This patch is on top of

block/for-3.18/core fe052529e465 ("scsi: move blk_mq_start_request call earlier")
+ [PATCHSET percpu/for-3.18] percpu_ref: implement switch_to_atomic/percpu()
http://lkml.kernel.org/g/1411451718-17807-1-git-send-email-***@kernel.org

and available in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git review-blk_mq-latency-fix

Thanks.

block/blk-mq-sysfs.c | 6 ++++++
block/blk-mq.c | 6 +++++-
block/blk-sysfs.c | 11 +++++++++--
include/linux/blk-mq.h | 1 +
4 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index ed52178..371d880 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -402,6 +402,12 @@ static void blk_mq_sysfs_init(struct request_queue *q)
}
}

+/* see blk_register_queue() */
+void blk_mq_finish_init(struct request_queue *q)
+{
+ percpu_ref_switch_to_percpu(&q->mq_usage_counter);
+}
+
int blk_mq_register_disk(struct gendisk *disk)
{
struct device *dev = disk_to_dev(disk);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 91c49b3..4ae58b1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1783,8 +1783,12 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
if (!q)
goto err_hctxs;

+ /*
+ * Init percpu_ref in atomic mode so that it's faster to shutdown.
+ * See blk_register_queue() for details.
+ */
if (percpu_ref_init(&q->mq_usage_counter, blk_mq_usage_counter_release,
- 0, GFP_KERNEL))
+ PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
goto err_map;

setup_timer(&q->timeout, blk_mq_rq_timer, (unsigned long) q);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 17f5c84..521ae90 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -551,12 +551,19 @@ int blk_register_queue(struct gendisk *disk)
return -ENXIO;

/*
- * Initialization must be complete by now. Finish the initial
- * bypass from queue allocation.
+ * SCSI probing may synchronously create and destroy a lot of
+ * request_queues for non-existent devices. Shutting down a fully
+ * functional queue takes measureable wallclock time as RCU grace
+ * periods are involved. To avoid excessive latency in these
+ * cases, a request_queue starts out in a degraded mode which is
+ * faster to shut down and is made fully functional here as
+ * request_queues for non-existent devices never get registered.
*/
if (!blk_queue_init_done(q)) {
queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);
blk_queue_bypass_end(q);
+ if (q->mq_ops)
+ blk_mq_finish_init(q);
}

ret = blk_trace_init_sysfs(dev);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 3253495..4495dd4 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -144,6 +144,7 @@ enum {
};

struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *);
+void blk_mq_finish_init(struct request_queue *q);
int blk_mq_register_disk(struct gendisk *);
void blk_mq_unregister_disk(struct gendisk *);
--
1.9.3
--
tejun
Tejun Heo
2014-09-24 17:33:52 UTC
Permalink
From 83b06f4fc6ca2f7f3d706a168b71c248bdada668 Mon Sep 17 00:00:00 2001
Date: Tue, 23 Sep 2014 01:58:34 -0400
blk-mq uses percpu_ref for its usage counter which tracks the number
of in-flight commands and used to synchronously drain the queue on
freeze. percpu_ref shutdown takes measureable wallclock time as it
involves a sched RCU grace period. This means that draining a blk-mq
takes measureable wallclock time. One would think that this shouldn't
matter as queue shutdown should be a rare event which takes place
asynchronously w.r.t. userland.
Applied to percpu/for-3.18 on top of the klude, its revert and the
required percpu_ref changes.

Thanks.
--
tejun
Tejun Heo
2014-09-24 17:43:02 UTC
Permalink
FYI, I added a paragraph explaining what happened to the temp fix at
the end. The following is the applied version.

Thanks.
----- 8< -----
From 17497acbdce9506fd6a75115dee4ab80c3cc5ee5 Mon Sep 17 00:00:00 2001
From: Tejun Heo <***@kernel.org>
Date: Wed, 24 Sep 2014 13:31:50 -0400

blk-mq uses percpu_ref for its usage counter which tracks the number
of in-flight commands and used to synchronously drain the queue on
freeze. percpu_ref shutdown takes measureable wallclock time as it
involves a sched RCU grace period. This means that draining a blk-mq
takes measureable wallclock time. One would think that this shouldn't
matter as queue shutdown should be a rare event which takes place
asynchronously w.r.t. userland.

Unfortunately, SCSI probing involves synchronously setting up and then
tearing down a lot of request_queues back-to-back for non-existent
LUNs. This means that SCSI probing may take above ten seconds when
scsi-mq is used.

[ 0.949892] scsi host0: Virtio SCSI HBA
[ 1.007864] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
[ 1.021299] scsi 0:0:1:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
[ 1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz

<stall>

[ 16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0
[ 16.194099] osd: LOADED open-osd 0.2.1
[ 16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[ 16.208478] sd 0:0:0:0: [sda] Write Protect is off
[ 16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[ 16.223264] sd 0:0:1:0: [sdb] Write Protect is off
[ 16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

This is also the reason why request_queues start in bypass mode which
is ended on blk_register_queue() as shutting down a fully functional
queue also involves a RCU grace period and the queues for non-existent
SCSI devices never reach registration.

blk-mq basically needs to do the same thing - start the mq in a
degraded mode which is faster to shut down and then make it fully
functional only after the queue reaches registration. percpu_ref
recently grew facilities to force atomic operation until explicitly
switched to percpu mode, which can be used for this purpose. This
patch makes blk-mq initialize q->mq_usage_counter in atomic mode and
switch it to percpu mode only once blk_register_queue() is reached.

Note that this issue was previously worked around by 0a30288da1ae
("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during
probe") for v3.17. The temp fix was reverted in preparation of adding
persistent atomic mode to percpu_ref by 9eca80461a45 ("Revert "blk-mq,
percpu_ref: implement a kludge for SCSI blk-mq stall during probe"").
This patch and the prerequisite percpu_ref changes will be merged
during v3.18 devel cycle.

Signed-off-by: Tejun Heo <***@kernel.org>
Reported-by: Christoph Hellwig <***@infradead.org>
Link: http://lkml.kernel.org/g/***@lst.de
Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count")
Reviewed-by: Kent Overstreet <***@daterainc.com>
Cc: Jens Axboe <***@kernel.dk>
Cc: Johannes Weiner <***@cmpxchg.org>
---
block/blk-mq-sysfs.c | 6 ++++++
block/blk-mq.c | 6 +++++-
block/blk-sysfs.c | 11 +++++++++--
include/linux/blk-mq.h | 1 +
4 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index ed52178..371d880 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -402,6 +402,12 @@ static void blk_mq_sysfs_init(struct request_queue *q)
}
}

+/* see blk_register_queue() */
+void blk_mq_finish_init(struct request_queue *q)
+{
+ percpu_ref_switch_to_percpu(&q->mq_usage_counter);
+}
+
int blk_mq_register_disk(struct gendisk *disk)
{
struct device *dev = disk_to_dev(disk);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d85fe01..38f4a16 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1795,8 +1795,12 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
if (!q)
goto err_hctxs;

+ /*
+ * Init percpu_ref in atomic mode so that it's faster to shutdown.
+ * See blk_register_queue() for details.
+ */
if (percpu_ref_init(&q->mq_usage_counter, blk_mq_usage_counter_release,
- 0, GFP_KERNEL))
+ PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
goto err_map;

setup_timer(&q->timeout, blk_mq_rq_timer, (unsigned long) q);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 17f5c84..521ae90 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -551,12 +551,19 @@ int blk_register_queue(struct gendisk *disk)
return -ENXIO;

/*
- * Initialization must be complete by now. Finish the initial
- * bypass from queue allocation.
+ * SCSI probing may synchronously create and destroy a lot of
+ * request_queues for non-existent devices. Shutting down a fully
+ * functional queue takes measureable wallclock time as RCU grace
+ * periods are involved. To avoid excessive latency in these
+ * cases, a request_queue starts out in a degraded mode which is
+ * faster to shut down and is made fully functional here as
+ * request_queues for non-existent devices never get registered.
*/
if (!blk_queue_init_done(q)) {
queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);
blk_queue_bypass_end(q);
+ if (q->mq_ops)
+ blk_mq_finish_init(q);
}

ret = blk_trace_init_sysfs(dev);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index a1e31f2..c13a0c0 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,6 +140,7 @@ enum {
};

struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *);
+void blk_mq_finish_init(struct request_queue *q);
int blk_mq_register_disk(struct gendisk *);
void blk_mq_unregister_disk(struct gendisk *);
--
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...