Concurrent SG_SCSI_RESET ioctls

Discussion:

Elliott, Robert (Server Storage)

2014-10-10 23:32:26 UTC

The sg3_utils command
sg_reset --device /dev/sda

invokes an ioctl with SG_SCSI_RESET, an argument of
SG_SCSI_RESET_DEVICE, on a device opened with O_NDELAY.

The call chain is like this:

sd_ioctl [sd.c]
scsi_nonblockable_ioctl [scsi_ioctl.c]
if (ndelay) { [this is true]
if (scsi_host_in_recovery(sdev->host))
return -ENODEV;
}
scsi_reset_provider [scsi_error.c]
shost->tmf_in_progress = 1;
scsi_try_bus_device_reset [scsi_error.c]
hostt->eh_device_reset_handler [LLD]
shost->tmf_in_progress = 1;
if that returns 0, return; otherwise try:
scsi_cmd_blk_ioctl
scsi_ioctl
sdev->host->hostt->ioctl [LLD]

where scsi_host_in_recovery is:

static inline int scsi_host_in_recovery(struct Scsi_Host *shost)
{
return shost->shost_state == SHOST_RECOVERY ||
shost->shost_state == SHOST_CANCEL_RECOVERY ||
shost->shost_state == SHOST_DEL_RECOVERY ||
shost->tmf_in_progress;
}

Problem
=======
If you run sg_reset --device concurrently to multiple
devices on the same host, then some of them will run
into tmf_in_progress and have scsi_nonblockable_ioctl
return -ENODEV. This causes sd_ioctl to send the ioctl
request to the LLD's ioctl function, where it gets
rejected as unsupported with -ENOTTY. sg_reset ends
up displaying:
sg_reset: SG_SCSI_RESET failed: Inappropriate ioctl for device

Any suggestions for how to fix this?

Is the check of scsi_host_in_recovery, which includes
tmf_in_progress, too strong? Most LLDs are not parallel
SCSI where you can just have one TMF on the bus at a
time anymore.

Is returning -ENODEV if the host is in recovery the
wrong code? There might be a device there...it's
just that access is temporarily blocked.

Should sd_ioctl not proceed to scsi_ioctl in so many
cases? Perhaps it should:
* proceed if it gets back -EINVAL (and maybe -ENOTTY)
indicating the ioctl specified was bad
* return if it gets back any other error (e.g., -ENODEV,
-EACCESS, or -EIO)
That would at least result in a more meaningful error.

Also, should scsi_nonblockable_ioctl return -ENOTTY rather
than -ENODEV if cmd is unsupported? There's not really
a no-device problem.

---
Rob Elliott HP Server Storage

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2014-10-11 11:05:05 UTC

Permalink

Post by Elliott, Robert (Server Storage)
Problem
=======
If you run sg_reset --device concurrently to multiple
devices on the same host, then some of them will run
into tmf_in_progress and have scsi_nonblockable_ioctl
return -ENODEV. This causes sd_ioctl to send the ioctl
request to the LLD's ioctl function, where it gets
rejected as unsupported with -ENOTTY. sg_reset ends
sg_reset: SG_SCSI_RESET failed: Inappropriate ioctl for device
Any suggestions for how to fix this?

The who cascading through the different ioctl handlers is a mess. I
think we'll need to rework it entirely, and i'll try to come up with
something for it ASAP.

Post by Elliott, Robert (Server Storage)
Is the check of scsi_host_in_recovery, which includes
tmf_in_progress, too strong? Most LLDs are not parallel
SCSI where you can just have one TMF on the bus at a
time anymore.

I suspect we could relax this, but that's a different issue
from the incorrect error handling in the ioctl handler cascade.

Post by Elliott, Robert (Server Storage)
Is returning -ENODEV if the host is in recovery the
wrong code? There might be a device there...it's
just that access is temporarily blocked.

It's an odd error return, but it's long established. I don't think
we can still change it.

Post by Elliott, Robert (Server Storage)
Also, should scsi_nonblockable_ioctl return -ENOTTY rather
than -ENODEV if cmd is unsupported? There's not really
a no-device problem.

It should - or even better we should kill off scsi_nonblockable_ioctl
entirely. Stay tuned.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2014-10-11 16:11:10 UTC

Permalink

Hi Robert,

can you take a look at the patches at:

http://git.infradead.org/users/hch/scsi.git/shortlog/refs/heads/scsi-ioctl

and confirm if they fix your issue?

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Elliott, Robert (Server Storage)

2014-10-11 22:06:47 UTC

Permalink

-----Original Message-----
Sent: Saturday, 11 October, 2014 11:11 AM
To: Elliott, Robert (Server Storage)
Webb
Subject: Re: Concurrent SG_SCSI_RESET ioctls
Hi Robert,
http://git.infradead.org/users/hch/scsi.git/shortlog/refs/heads/scsi-ioctl
and confirm if they fix your issue?

Thanks. That's a bit better, but sg_reset can now run into
"No such device" errors.

sg_reset: starting device reset
sg_reset: starting device reset
sg_reset: SG_SCSI_RESET failed: No such device
sg_reset: completed device (or target or bus or host) reset
sg_reset: starting device reset
sg_reset: completed device (or target or bus or host) reset
sg_reset: starting device reset
sg_reset: completed device (or target or bus or host) reset

---
Rob Elliott HP Server Storage

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2014-10-14 11:13:54 UTC

Permalink

Post by Elliott, Robert (Server Storage)
Thanks. That's a bit better, but sg_reset can now run into
"No such device" errors.

That's the -ENODEV we return if another reset is in progress. Given
that I suspect sg_reset is the prime if not only user of this interface
I'm happy to change the error code we return here to anything Doug
agrees to.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Douglas Gilbert

2014-10-14 16:42:26 UTC

Permalink

Post by Christoph Hellwig

Post by Elliott, Robert (Server Storage)
Thanks. That's a bit better, but sg_reset can now run into
"No such device" errors.

I don't mind if you change it. However I plan to release
sg3_utils-1.40 in the next 2 or 3 weeks, so that would
be the earliest a revised sg_reset would be available for
distros. Improving error reports is something I always
like to do (so ENODEV for the "in progress" case seems a
bit strident).

On the other hand, the existing sg_reset code with --no-esc
has been in place since sg3_utils-1.36 [20130531]. So there
is a good chance current distributions have that version or
later.

Doug Gilbert

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2014-10-15 13:25:37 UTC

Permalink

Post by Douglas Gilbert
I don't mind if you change it. However I plan to release
sg3_utils-1.40 in the next 2 or 3 weeks, so that would
be the earliest a revised sg_reset would be available for
distros. Improving error reports is something I always
like to do (so ENODEV for the "in progress" case seems a
bit strident).

If sg_utils needs any changes for a different ENODEV we shouldn't
bother - breaking backwards compatibility is a bad idea.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Elliott, Robert (Server Storage)

2014-10-15 14:34:50 UTC

Permalink

-----Original Message-----
Sent: Wednesday, 15 October, 2014 8:26 AM

...

If sg_utils needs any changes for a different ENODEV we shouldn't
bother - breaking backwards compatibility is a bad idea.

None of the values will cause the current version of sg_reset to
retry the request. Maybe -EBUSY should be returned, and interpreted
that way by future versions of sg_reset?

-----Original Message-----
Sent: Monday, 13 October, 2014 5:24 AM

...

Eliminating serialization would be better, though. Devices should
be independent, so bus device resets (that are not escalated)
should also be independent.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2014-10-24 14:51:31 UTC

Permalink

Doug, are you fine with changing the error code to EAGAIN
similar to what we get elsewhere when a O_NONBLOCK open
would have to block? If that's something you can prepare
sg3_utils for I'd like to send a patch for it for the 3.19 merge
window.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Elliott, Robert (Server Storage)

2014-10-11 23:04:17 UTC

Permalink

Post by Elliott, Robert (Server Storage)

Sent: Saturday, 11 October, 2014 11:11 AM

...

Post by Elliott, Robert (Server Storage)

Hi Robert,
http://git.infradead.org/users/hch/scsi.git/shortlog/refs/heads/scsi-ioctl
and confirm if they fix your issue?

Running fio concurrently with all the sg_resets, it also led to
a hang with:
* all the fio threads sitting in blk_mq_get_tag, reported
as blocked tasks after 120 seconds
* all the sg_resets from pass 23 hung

Starting with sg_reset pass 22, where 6 of 14 got
"No such device", here is the serial log (with a few
extra prints added in the scsi midlayer):

[ 2750.165637] ------------------ loop-sg_reset pass 22:
[ 2750.175055] sd_ioctl continuing cmd=0x2284
[ 2750.175469] sd_ioctl continuing cmd=0x2284
[ 2750.175470] sd_ioctl default cmd=0x2284
[ 2750.175474] calling scsi_try_bus_device_reset
[ 2750.175476] hpsa_eh_device_reset_handler called with ffff8804097297c0
[ 2750.175482] hpsa 0000:04:00.0: resetting scsi 2:0:0:0: Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap+ En- Exp=3
[ 2750.181328] hpsa 0000:04:00.0: resetting scsi 2:0:0:0 completed successfully
[ 2750.181330] scsi host2: waking up host to restart after TMF
[ 2750.181377] sd_ioctl continuing cmd=0x2284
[ 2750.181377] sd_ioctl default cmd=0x2284
[ 2750.181379] sd_ioctl continuing cmd=0x2284
[ 2750.181381] sd_ioctl continuing cmd=0x2284
[ 2750.181382] sd_ioctl default cmd=0x2284
[ 2750.181382] calling scsi_try_bus_device_reset
[ 2750.181383] sd_ioctl default cmd=0x2284
[ 2750.181384] hpsa_eh_device_reset_handler called with ffff8803f55de500
[ 2750.181388] hpsa 0000:04:00.0: resetting scsi 2:0:0:12: Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap+ En- Exp=3
[ 2750.181427] hpsa 0000:04:00.0: resetting scsi 2:0:0:12 completed successfully
[ 2750.181428] scsi host2: waking up host to restart after TMF
[ 2750.181443] calling scsi_try_bus_device_reset
[ 2750.181444] calling scsi_try_bus_device_reset
[ 2750.181446] hpsa_eh_device_reset_handler called with ffff8803f966ae00
[ 2750.181447] hpsa_eh_device_reset_handler called with ffff88042d558cc0
[ 2750.181451] hpsa 0000:04:00.0: resetting scsi 2:0:0:8: Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap+ En+ Exp=3
[ 2750.181454] hpsa 0000:04:00.0: resetting scsi 2:0:0:9: Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap+ En- Exp=3
[ 2750.181562] hpsa 0000:04:00.0: resetting scsi 2:0:0:8 completed successfully
[ 2750.181564] scsi host2: waking up host to restart after TMF
[ 2750.181570] hpsa 0000:04:00.0: resetting scsi 2:0:0:9 completed successfully
[ 2750.181572] scsi host2: waking up host to restart after TMF
[ 2750.181] sd_ioctl continuing cmd=0x2285
[ 2750.181840] sd_ioctl default cmd=0x2285
[ 2750.181842] sd_ioctl continuing cmd=0x2285
[ 2750.181842] sd_ioctl default cmd=0x2285
[ 2750.181982] sd_ioctl continuing cmd=0x2285
[ 2750.181982] sd_ioctl default cmd=0x2285
[ 2750.181987] sd_ioctl continuing cmd=0x2285
[ 2750.181988] sd_ioctl default cmd=0x2285
[ 2750.181990] sd_ioctl continuing cmd=0x2285
[ 2750.181990] sd_ioctl default cmd=0x2285
[ 2750.182075] sd_ioctl continuing cmd=0x2285
[ 2750.182076] sd_ioctl default cmd=0x2285
[ 2750.182148] sd_ioctl continuing cmd=0x2285
[ 2750.182148] sd_ioctl default cmd=0x2285
[ 2750.182210] sd_ioctl continuing cmd=0x2285
[ 2750.182211] sd_ioctl default cmd=0x2285
[ 2750.182362] sd_ioctl continuing cmd=0x2285
[ 2750.182362] sd_ioctl default cmd=0x2285
[ 2750.182498] sd_ioctl continuing cmd=0x2285
[ 2750.182499] sd_ioctl default cmd=0x2285
[ 2750.185274] sd_ioctl continuing cmd=0x5331
[ 2750.185275] sd_ioctl default cmd=0x5331
[ 2750.185277] hpsa_ioctl rejecting cmd=0x5331 with -ENOTTY
[ 2750.185612] sd_ioctl continuing cmd=0x2285
[ 2750.185613] sd_ioctl default cmd=0x2285
[ 2750.185614] sd_ioctl continuing cmd=0x2285
[ 2750.185615] sd_ioctl default cmd=0x2285
[ 2750.185888] sd_ioctl continuing cmd=0x2284
[ 2750.185889] sd_ioctl default cmd=0x2284
[ 2750.185892] calling scsi_try_bus_device_reset
[ 2750.185894] hpsa_eh_device_reset_handler called with ffff8804097297c0
[ 2750.185900] hpsa 0000:04:00.0: resetting scsi 2:0:0:5: Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap+ En+ Exp=3
[ 2750.191332] hpsa 0000:04:00.0: resetting scsi 2:0:0:5 completed successfully
[ 2750.191334] scsi host2: waking up host to restart after TMF
[ 2750.191357] sd_ioctl continuing cmd=0x2285
[ 2750.191358] sd_ioctl continuing cmd=0x2285
[ 2750.191360] sd_ioctl continuing cmd=0x2285
[ 2750.191361] sd_ioctl default cmd=0x2285
[ 2750.191361] sd_ioctl default cmd=0x2285
[ 2750.191362] sd_ioctl default cmd=0x2285
2750.191363] sd_ioctl continuing cmd=0x22855
[ 2750.191364] sd_ioctl default cmd=0x2285
[ 2750.191364] sd_ioctl continuing cmd=0x2285
[ 2750.191365] sd_ioctl default cmd=0x2285
[ 2750.191366] sd_ioctl default cmd=0x2285
[ 2750.191368] sd_ioctl continuing cmd=0x2285
[ 2750.191368] sd_ioctl default cmd=0x2285
[ 2750.191369] sd_ioctl continuing cmd=0x2285
[ 2750.191370] sd_ioctl default cmd=0x2285
[ 2750.191375] sd_ioctl continuing cmd=0x5331
[ 2750.191376] sd_ioctl default cmd=0x5331
[ 2750.191377] hpsa_ioctl rejecting cmd=0x5331 with -ENOTTY
[ 2750.191390] sd_ioctl continuing cmd=0x2284
[ 2750.191391] sd_ioctl default cmd=0x2284
[ 2750.191394] calling scsi_try_bus_device_reset
[ 2750.191395] hpsa_eh_device_reset_handler called with ffff88042d558cc0
[ 2750.191399] hpsa 0000:04:00.0: resetting scsi 2:0:0:6: Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap+ En+ Exp=3
[ 2750.191454] hpsa 0000:04:00.0: resetting scsi 2:0:0:6 completed successfully
[ 2750.191455] scsi host2: waking up host to restart after TMF
[ 2750.192487] sd_ioctl continuing cmd=0x2285
[ 2750.192488] sd_ioctl default cmd=0x2285
[ 2750.192648] sd_ioctl continuing cmd=0x2285
[ 2750.192648] sd_ioctl default cmd=0x2285
[ 2750.193005] sd_ioctl continuing cmd=0x2285
[ 2750.193006] sd_ioctl default cmd=0x2285
[ 2750.193055] sd_ioctl continuing cmd=0x2285
[ 2750.193056] sd_ioctl default cmd=0x2285
[ 2750.193963] sd_ioctl continuing cmd=0x2285
[ 2750.193964] sd_ioctl default cmd=0x2285
[ 2750.194211] sd_ioctl continuing cmd=0x2285
[ 2750.194212] sd_ioctl default cmd=0x2285
[ 2750.194514] sd_ioctl continuing cmd=0x2285
[ 2750.194515] sd_ioctl default cmd=0x2285
[ 2750.194522] sd_ioctl continuing cmd=0x2285
[ 2750.194523] sd_ioctl default cmd=0x2285
[ 2750.195154] sd_ioctl continuing cmd=0x2285
[ 2750.195155] sd_ioctl default cmd=0x2285
[ 2750.195156] sd_ioctl continuing cmd=0x2285
[ 2750.195157] sd_ioctl default cmd=0x2285
[ 2750.195448] sd_ioctl continuing cmd=0x2285
[ 2750.195448] sd_ioctl default cmx2285
[ 2750.196647] sd_ioctl continuing cmd=0x2285
[ 2750.196648] sd_ioctl default cmd=0x2285
[ 2750.197132] sd_ioctl continuing cmd=0x2285
[ 2750.197133] sd_ioctl default cmd=0x2285
[ 2750.197134] sd_ioctl continuing cmd=0x2285
[ 2750.197135] sd_ioctl default cmd=0x2285
[ 2750.197157] sd_ioctl continuing cmd=0x2285
[ 2750.197157] sd_ioctl default cmd=0x2285
[ 2750.197708] sd_ioctl continuing cmd=0x2285
[ 2750.197708] sd_ioctl default cmd=0x2285
[ 2750.198254] sd_ioctl continuing cmd=0x2285
[ 2750.198260] sd_ioctl default cmd=0x2285
[ 2750.198281] sd_ioctl continuing cmd=0x2285
[ 2750.198282] sd_ioctl default cmd=0x2285
[ 2750.198338] sd_ioctl continuing cmd=0x2285
[ 2750.198338] sd_ioctl default cmd=0x2285
[ 2750.199244] sd_ioctl continuing cmd=0x2285
[ 2750.199245] sd_ioctl default cmd=0x2285
[ 2750.201160] sd_ioctl continuing cmd=0x2285
[ 2750.201161] sd_ioctl default cmd=0x2285
[ 2750.202978] sd_ioctl continuing cmd=0x5331
[ 2750.202979] sd_ioctl default cmd=0x5331
[ 2750.202980] hpsa_ioctl rejecting cmd=0x5331 with -ENOTTY
[ 2750.203182] sd_ioctl continuing cmd=0x5331
[ 2750.203182] sd_ioctl default cmd=0x5331
[ 2750.203184] hpsa_ioctl rejecting cmd=0x5331 with -ENOTTY
[ 2750.204202] sd_ioctl continuing cmd=0x2285
[ 2750.204202] sd_ioctl default cmd=0x2285
[ 2750.205480] sd_ioctl continuing cmd=0x2285
[ 2750.205480] sd_ioctl default cmd=0x2285
[ 2750.209410] sd_ioctl continuing cmd=0x5331
[ 2750.209411] sd_ioctl default cmd=0x5331
[ 2750.209413] hpsa_ioctl rejecting cmd=0x5331 with -ENOTTY
[ 2750.216017] sd_ioctl continuing cmd=0x5331
[ 2750.216018] sd_ioctl default cmd=0x5331
[ 2750.216019] hpsa_ioctl rejecting cmd=0x5331 with -ENOTTY
[ 2750.217894] sd_ioctl continuing cmd=0x2285
[ 2750.217895] sd_ioctl default cmd=0x2285
[ 2750.219160] sd_ioctl continuing cmd=0x2285
[ 2750.219161] sd_ioctl default cmd=0x2285
[ 2750.221139] sd_ioctl continuing cmd=0x5331
[ 2750.221140] sd_ioctl default cmd=0x5331
[ 2750.221185] hpsa_ioctl rejecting cmd=0x5331 with -ENOTTY
[ 27503518] sd_ioctl continuing cmd=0x5331
[ 2750.223519] sd_ioctl default cmd=0x5331
[ 2750.223521] hpsa_ioctl rejecting cmd=0x5331 with -ENOTTY
[ 2750.229958] sd_ioctl continuing cmd=0x5331
[ 2750.229959] sd_ioctl default cmd=0x5331
[ 2750.229961] hpsa_ioctl rejecting cmd=0x5331 with -ENOTTY
[ 2750.833956] sd_ioctl default cmd=0x2284
[ 2750.835317] calling scsi_try_bus_device_reset
[ 2750.837390] hpsa_eh_device_reset_handler called with ffff880409505380
[ 2750.840046] hpsa 0000:04:00.0: resetting scsi 2:0:0:4: Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap+ En- Exp=3
[ 2750.844430] hpsa 0000:04:00.0: resetting scsi 2:0:0:4 completed successfully
[ 2750.846862] scsi host2: waking up host to restart after TMF
[ 2750.851673] sd_ioctl continuing cmd=0x2285
[ 2750.853238] sd_ioctl default cmd=0x2285
[ 2750.854572] sd_ioctl continuing cmd=0x2285
[ 2750.856024] sd_ioctl default cmd=0x2285
[ 2750.857500] sd_ioctl continuing cmd=0x2285
[ 2750.859011] sd_ioctl default cmd=0x2285
[ 2750.860439] sd_ioctl continuing cmd=0x2285
[ 2750.862025] sd_ioctl default cmd=0x2285
[ 2750.863615] sd_ioctl continuing cmd=0x2285
[ 2750.865214] sd_ioctl default cmd=0x2285
[ 2750.869520] sd_ioctl continuing cmd=0x5331
[ 2750.871225] sd_ioctl default cmd=0x5331
[ 2750.872554] hpsa_ioctl rejecting cmd=0x5331 with -ENOTTY

[ 2751.852314] ------------------ loop-sg_reset pass 23:
[ 2751.856118] sd_ioctl continuing cmd=0x2284
[ 2751.857561] sd_ioctl continuing cmd=0x2284
[ 2751.857562] sd_ioctl default cmd=0x2284
[ 2751.857568] calling scsi_try_bus_device_reset
[ 2751.857571] hpsa_eh_device_reset_handler called with ffff8803f5454840
[ 2751.857577] hpsa 0000:04:00.0: resetting scsi 2:0:0:1: Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap+ En- Exp=3
[ 2751.868394] sd_ioctl default cmd=0x2284

[ 2879.197989] INFO: task fio:13086 blocked for more than 120 seconds.
[ 2879.200164] Tainted: G E 3.17.0+ #93
[ 2879.201948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2879.204610] fio D 0000000000000000 0 13086 13081 0x00000000
[ 2879.207085] ffff88040383b7d8 0000000000000086 ffff88040383b7a8 ffff880403838010
[ 2879.209890] 0000000000012c40 0000000000012c40 ffff8800b1533050 ffff8804038fc190
[ 2879.212612] ffff88040383b7a8 ffff88043f412c40 ffff8804038fc190 ffff88040383b908
[ 2879.215323] Call Trace:
[ 2879.216116] [<ffffffff815a8fe9>] schedule+0x29/0x70
[ 2879.217840] [<ffffffff815a90bc>] io_schedule+0x8c/0xd0
[ 2879.219677] [<ffffffff81276fd7>] bt_get+0x107/0x190
[ 2879.221434] [<ffffffff8108fc40>] ? bit_waitqueue+0xa0/0xa0
[ 2879.223318] [<ffffffff812770cc>] blk_mq_get_tag+0x6c/0xc0
[ 2879.225100] [<ffffffff812735de>] __blk_mq_alloc_request+0x1e/0x230
[ 2879.227280] [<ffffffff81274c6a>] blk_mq_map_request+0x17a/0x1d0
[ 2879.229437] [<ffffffff81275017>] blk_sq_make_request+0x97/0x380
[ 2879.231473] [<ffffffff81138545>] ? mempool_alloc_slab+0x15/0x20
[ 2879.233516] [<ffffffff8126a7c0>] generic_make_request+0xc0/0x100
[ 2879.235580] [<ffffffff8126a871>] submit_bio+0x71/0x140
[ 2879.237413] [<ffffffff811415d8>] ? set_page_dirty_lock+0x38/0x60
[ 2879.239559] [<ffffffff8126269c>] ? bio_set_pages_dirty+0x4c/0x70
[ 2879.241635] [<ffffffff811d40c8>] do_blockdev_direct_IO+0xa88/0xbb0
[ 2879.243761] [<ffffffff8113774a>] ? filemap_map_pages+0x1ca/0x210
[ 2879.245853] [<ffffffff811cf040>] ? I_BDEV+0x10/0x10
[ 2879.247571] [<ffffffff811d3641>] ? do_blockdev_direct_IO+0x1/0xbb0
[ 2879.249770] [<ffffffff811d423c>] __blockdev_direct_IO+0x4c/0x50
[ 2879.251815] [<ffffffff811cf040>] ? I_BDEV+0x10/0x10
[ 2879.253581] [<ffffffff811cffde>] blkdev_direct_IO+0x4e/0x50
[ 2879.255512] [<ffffffff811cf040>] ? I_BDEV+0x10/0x10
[ 2879.257171] [<ffffffff81137d93>] generic_file_read_iter+0x143/0x150
[ 2879.259425] [<fffff811cf270>] ? ioctl_by_bdev+0x50/0x50
[ 2879.361362] [<ffffffff811cf2a7>] blkdev_read_iter+0x37/0x40
[ 2879.363266] [<ffffffff811df890>] aio_run_iocb+0x230/0x430
[ 2879.365163] [<ffffffff815a8ace>] ? __schedule+0x34e/0x720
[ 2879.367489] [<ffffffff81188c90>] ? kmem_cache_alloc+0x70/0x160
[ 2879.369590] [<ffffffff811dfc69>] io_submit_one+0x1d9/0x440
[ 2879.371524] [<ffffffff811e000c>] do_io_submit+0x13c/0x200
[ 2879.373441] [<ffffffff81089223>] ? pick_next_task_fair+0x163/0x220
[ 2879.375574] [<ffffffff811e00e0>] SyS_io_submit+0x10/0x20
[ 2879.377431] [<ffffffff815acc52>] system_call_fastpath+0x16/0x1b
[ 2879.379568] INFO: task fio:13087 blocked for more than 120 seconds.
[ 2879.381729] Tainted: G E 3.17.0+ #93
[ 2879.383444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2879.385954] fio D 0000000000000001 0 13087 13081 0x00000000
[ 2879.388409] ffff88041c6f77d8 0000000000000086 ffff88041c6f77a8 ffff88041c6f4010
[ 2879.390840] 0000000000012c40 0000000000012c40 ffff88042f0010d0 ffff88041c6570d0
[ 2879.393269] ffff88041c6f77a8 ffff88043f432c40 ffff88041c6570d0 ffff88041c6f7908
[ 2879.395706] Call Trace:
[ 2879.396527] [<ffffffff815a8fe9>] schedule+0x29/0x70
[ 2879.398264] [<ffffffff815a90bc>] io_schedule+0x8c/0xd0
[ 2879.400047] [<ffffffff81276fd7>] bt_get+0x107/0x190
[ 2879.401780] [<ffffffff8108fc40>] ? bit_waitqueue+0xa0/0xa0
[ 2879.403897] [<ffffffff812770cc>] blk_mq_get_tag+0x6c/0xc0
[ 2879.405835] [<ffffffff812735de>] __blk_mq_alloc_request+0x1e/0x230
[ 2879.408000] [<ffffffff81274c6a>] blk_mq_map_request+0x17a/0x1d0
[ 2879.410100] [<ffffffff81275017>] blk_sq_make_request+0x97/0x380
[ 2879.412150] [<ffffffff81138545>] ? mempool_alloc_slab+0x15/0x20
[ 2879.414230] [<ffffffff8126a7c0>] generic_make_request+0xc0/0x100
[ 2879.416293] [<ffffffff8126a871>] submit_bio+0x71/0x140
[ 2879.418100] [<ffffffff811415d8>] ? set_page_dirty_lock+0x38/0x60
[ 2879.420237] [<ffffffff8126269c>] ? bio_set_pagesrty+0x4c/0x70
[ 2879.522322] [<ffffffff811d40c8>] do_blockdev_direct_IO+0xa88/0xbb0
[ 2879.524574] [<ffffffff8113774a>] ? filemap_map_pages+0x1ca/0x210
[ 2879.526708] [<ffffffff811cf040>] ? I_BDEV+0x10/0x10
[ 2879.528567] [<ffffffff811d423c>] __blockdev_direct_IO+0x4c/0x50
[ 2879.530624] [<ffffffff811cf040>] ? I_BDEV+0x10/0x10
[ 2879.532297] [<ffffffff811cffde>] blkdev_direct_IO+0x4e/0x50
[ 2879.534267] [<ffffffff811cf040>] ? I_BDEV+0x10/0x10
[ 2879.535967] [<ffffffff81137d93>] generic_file_read_iter+0x143/0x150
[ 2879.538206] [<ffffffff811cf270>] ? ioctl_by_bdev+0x50/0x50
[ 2879.540107] [<ffffffff811cf2a7>] blkdev_read_iter+0x37/0x40
[ 2879.542055] [<ffffffff811df890>] aio_run_iocb+0x230/0x430
[ 2879.543892] [<ffffffff815a8ace>] ? __schedule+0x34e/0x720
[ 2879.545775] [<ffffffff811dfc69>] io_submit_one+0x1d9/0x440
[ 2879.547724] [<ffffffff811e000c>] do_io_submit+0x13c/0x200
[ 2879.549701] [<ffffffff811e00e0>] SyS_io_submit+0x10/0x20
[ 2879.551580] [<ffffffff815acc52>] system_call_fastpath+0x16/0x1b
[ 2879.553618] INFO: task fio:13088 blocked for more than 120 seconds.
<several more of those>

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Bart Van Assche

2014-10-13 09:15:19 UTC

Permalink

Post by Christoph Hellwig
Hi Robert,
http://git.infradead.org/users/hch/scsi.git/shortlog/refs/heads/scsi-ioctl
and confirm if they fix your issue?

Hello Christoph,

At least to me patches 1/4..3/4 look like nice cleanup patches.
Regarding patch 4/4: I'm not sure yet what's the best way for addressing
potentially concurrent SG_SCSI_RESET ioctl calls. As far as I know many
SCSI LLDs have been implemented assuming that eh_*_reset_handler() calls
are serialized per SCSI host. Does this mean that a mutex has to be
added to avoid that an eh_*_reset_handler() call can be triggered via an
ioctl while at the same time the SCSI error handler thread is invoking
one of the eh_*_reset_handler() callback functions due to SCSI error
handling ?

For a prior discussion of this topic, see also
http://thread.gmane.org/gmane.linux.scsi/82582/focus=82747.

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2014-10-13 10:23:39 UTC

Permalink

At least to me patches 1/4..3/4 look like nice cleanup patches. Regarding
patch 4/4: I'm not sure yet what's the best way for addressing potentially
concurrent SG_SCSI_RESET ioctl calls. As far as I know many SCSI LLDs have
been implemented assuming that eh_*_reset_handler() calls are serialized per
SCSI host. Does this mean that a mutex has to be added to avoid that an
eh_*_reset_handler() call can be triggered via an ioctl while at the same
time the SCSI error handler thread is invoking one of the
eh_*_reset_handler() callback functions due to SCSI error handling ?

Both the existing code and my new code still serialize
eh_*_reset_handler callers using the crude tmf_in_progress flag. Using
a proper lock for it would seem preferable to me, as would be bouncing
the work for SG_SCSI_RESET to the EH thread.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html