O_DIRECT and barriers

Post by Christoph Hellwig
Currently O_DIRECT writes bypass all kernel caches, but there they do
use the disk caches. We currenly don't have any barrier support for
them at all, which is really bad for data integrity in virtualized
environments. I've started thinking about how to implement this.
The simplest scheme would be to mark the last request of each
O_DIRECT write as barrier requests. This works nicely from the FS
perspective and works with all hardware supporting barriers. It's
massive overkill though - we really only need to flush the cache
after our request, and not before. And for SCSI we would be much
better just setting the FUA bit on the commands and not require a
full cache flush at all.
The next scheme would be to simply always do a cache flush after
the direct I/O write has completed, but given that blkdev_issue_flush
blocks until the command is done that would a) require everyone to
use the end_io callback and b) spend a lot of time in that workque.
This only requires one full cache flush, but it's still suboptimal.
I have prototypes this for XFS, but I don't really like it.
The best scheme would be to get some highlevel FUA request in the
block layer which gets emulated by a post-command cache flush.

I've talked to Chris about this in the past too, but I never got around
to benchmarking FUA for O_DIRECT. It should be pretty easy to wire up
without making too many changes, and we do have FUA support on most SATA
drives too. Basically just a check in the driver for whether the
request is O_DIRECT and a WRITE, ala:

if (rq_data_dir(rq) == WRITE && rq_is_sync(rq))
WRITE_FUA;

I know that FUA is used by that other OS, so I think we should be golden
on the hw support side.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jamie Lokier

2009-08-21 13:54:03 UTC

Post by Jens Axboe

I've talked to Chris about this in the past too, but I never got around
to benchmarking FUA for O_DIRECT. It should be pretty easy to wire up
without making too many changes, and we do have FUA support on most SATA
drives too. Basically just a check in the driver for whether the
if (rq_data_dir(rq) == WRITE && rq_is_sync(rq))
WRITE_FUA;
I know that FUA is used by that other OS, so I think we should be golden
on the hw support side.

I've been thinking about this too, and for optimal performance with
VMs and also with databases, I think FUA is too strong. (It's also
too weak, on drives which don't have FUA).

I would like to be able to get the same performance and integrity as
the kernel filesystems can get, and that means using barrier flushes
when a kernel filesystem would use them, and FUA when a kernel
filesystem would use that. Preferably the same whether userspace is
using a file or a block device.

The conclusion I came to is that O_DIRECT users need a barrier flush
primitive. FUA can either be deduced by the elevator, or signalled
explicitly by userspace.

Fortunately there's already a sensible API for both: fdatasync (and
aio_fsync) to mean flush, and O_DSYNC (or inferred from
flush-after-one-write) to mean FUA.

Those apply to files, but they could be made to have the same effect
with block devices, which would be nice for applications which can use
both. I'll talk about files from here on; assume the idea is to
provide the same functions for block devices.

It turns out that applications needing integrity must use fdatasync or
O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
choose to use buffered writes at any time, with no signal to the
application. O_DSYNC or fdatasync ensures that unknown buffered
writes will be committed. This is true for other operating systems
too, for the same reason, except some other unixes will convert all
writes to buffered writes, not just corner cases, under various
circumstances that it's hard for applications to detect.

So there's already a good match to using fdatasync and/or O_DSYNC for
O_DIRECT integrity.

If we define fdatasync's behaviour to be that it always causes a
barrier flush if there have been any WRITE commands to a disk since
the last barrier flush, in addition to it's behaviour of flushing
cached pages, that would be enough for VM and database applications
would have good support for integrity. Of course O_DSYNC would imply
the same after each write.

As an optimisation, I think that FUA might be best done by the
elevator detecting opportunities to do that, rather than explicitly
signalled.

For VMs, the highest performance (with integrity) will likely come from:

If the guest requests a virtual disk with write cache enabled:

- Host opens file/blockdev with O_DIRECT (but *not O_DSYNC*)
- Host maps guests WRITE commands to host writes
- Host maps guests CACHE FLUSH commands to fdatasync on host

If the guest requests a virtual disk with write cache disabled:

- Host opens file/blockdev with O_DIRECT|O_DSYNC
- Host maps guests WRITE commands to host writes
- Host maps guests CACHE FLUSH commands to nothing

That's with host configured to use O_DIRECT. If the host is
configured to not use O_DIRECT, the same logic applies except that
O_DIRECT is simply omitted. Nice and simple eh?

Databases and userspace filesystems would be encouraged to do the
equivalent. In other words, databases would open with O_DIRECT or not
(depending on behaviour preferred), and use fdatasync for barriers, or
use O_DSYNC if they are not using fdatasync.

Notice how it conveniently does the right thing when the kernel falls
back to buffered writes without telling anyone.

Code written in that way should do the right thing (or as close as
it's possible to get) on other OSes too.

(Btw, from what I can tell from various Windows documentation, it maps
the equivalent of O_DIRECT|O_DSYNC to setting FUA on every disk write,
and it maps the equivalent of fsync to sending a the disk a cache
flush command as well as writing file metadata. There's no Windows
equivalent to O_SYNC or fdatasync.)

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2009-08-21 14:26:35 UTC

Post by Jamie Lokier
I've been thinking about this too, and for optimal performance with
VMs and also with databases, I think FUA is too strong. (It's also
too weak, on drives which don't have FUA).

Why is FUA too strong?

Post by Jamie Lokier
Fortunately there's already a sensible API for both: fdatasync (and
aio_fsync) to mean flush, and O_DSYNC (or inferred from
flush-after-one-write) to mean FUA.

I thought about this alot . It would be sensible to only require
the FUA semantics if O_SYNC is specified. But from looking around at
users of O_DIRECT no one seems to actually specify O_SYNC with it.
And on Linux where O_SYNC really means O_DYSNC that's pretty sensible -
if O_DIRECT bypasses the filesystem cache there is nothing else
left to sync for a non-extending write. That is until those pesky disk
write back caches come into play that no application writer wants or
should have to understand.

Post by Jamie Lokier
It turns out that applications needing integrity must use fdatasync or
O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
choose to use buffered writes at any time, with no signal to the
application.

The fallback was a relatively recent addition to the O_DIRECT semantics
for broken filesystems that can't handle holes very well. Fortunately
enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
semantics for that already.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jamie Lokier

2009-08-21 15:24:59 UTC

Why is FUA too strong?

In measurements I've done, disabling a disk's write cache results in
much slower ext3 filesystem writes than using barriers. Others report
similar results. This is with disks that don't have NCQ; good NCQ may
be better.

Using FUA for all writes should be equivalent to writing with write
cache disabled.

A journalling filesystem or database tends to write like this:

(guest) WRITE
(guest) WRITE
(guest) WRITE
(guest) WRITE
(guest) WRITE
(guest) CACHE FLUSH
(guest) WRITE
(guest) CACHE FLUSH
(guest) WRITE
(guest) WRITE
(guest) WRITE

When a guest does that, for integrity it can be mapped to this on the
host with FUA:

(host) WRITE FUA
(host) WRITE FUA
(host) WRITE FUA
(host) WRITE FUA
(host) WRITE FUA
(host) WRITE FUA
(host) WRITE FUA
(host) WRITE FUA
(host) WRITE FUA

or

(host) WRITE
(host) WRITE
(host) WRITE
(host) WRITE
(host) WRITE
(host) CACHE FLUSH
(host) WRITE
(host) CACHE FLUSH
(host) WRITE
(host) WRITE
(host) WRITE

We know from measurements that disabling the disk write cache is much
slower than using barriers, at least with some disks.

Assuming that WRITE FUA is equivalent to disabling write cache, we may
expect the WRITE FUA version to run much slower than the CACHE FLUSH
version.

It's also too weak, of course, on drives which don't support FUA.
Then you have to use CACHE FLUSH anyway, so the code should support
that (or disable the write cache entirely, which also performs badly).
If you don't handle drives without FUA, then you're back to "integrity
sometimes, user must check type of hardware", which is something we're
trying to get away from. Integrity should not be a surprise when the
application requests it.

Post by Jamie Lokier
Fortunately there's already a sensible API for both: fdatasync (and
aio_fsync) to mean flush, and O_DSYNC (or inferred from
flush-after-one-write) to mean FUA.

O_DIRECT with true POSIX O_SYNC is a bad idea, because it flushes
inode metadata (like mtime) too. O_DIRECT|O_DSYNC is better.

O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for
integrity problems when direct writes are converted to buffered writes
- which applies to all or nearly all OSes according to their
documentation (I've read a lot of them).

I notice that all applications I looked at which use O_DIRECT don't
attempt to determine when O_DIRECT will definitely result in direct
writes; they simpy assume it can be used as a substituted for O_SYNC
or O_DSYNC, as long as you follow the alignment rules. Generally they
leave it to the user to configure what they want, and often don't
explain the drive integrity issue, except to say "depends on the OS,
your mileage may vary, we can do nothing about it".

Imho, integrity should not be something which depends on the user
knowing the details of their hardware to decide application
configuration options - at least, not out of the box.

On a related note,
http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm
says:

Direct I/O and Data I/O Integrity Completion

Although direct I/O writes are done synchronously, they do not
provide synchronized I/O data integrity completion, as defined by
POSIX. Applications that need this feature should use O_DSYNC in
addition to O_DIRECT. O_DSYNC guarantees that all of the data and
enough of the metadata (for example, indirect blocks) have written
to the stable store to be able to retrieve the data after a system
crash. O_DIRECT only writes the data; it does not write the
metadata.

That's another reason to use O_DIRECT|O_DSYNC in moderately portable code.

Post by Christoph Hellwig
And on Linux where O_SYNC really means O_DYSNC that's pretty sensible -
if O_DIRECT bypasses the filesystem cache there is nothing else
left to sync for a non-extending write.

Oh, O_SYNC means O_DSYNC? I thought it was the other way around.
Ugh, how messy.

Post by Christoph Hellwig
That is until those pesky disk
write back caches come into play that no application writer wants or
should have to understand.

As far as I can tell, they generally go out of their way to avoid
understanding it, except as a vaguely uncomfortable awareness and pass
the problem on to the application's user.

Unfortunately just disabling the disk cache for O_DIRECT would make
it's performance drop significantly, otherwise I'd say go for it.

Post by Jamie Lokier
It turns out that applications needing integrity must use fdatasync or
O_DSYNC (or O_SYNfC) *already* with O_DIRECT, because the kernel may
choose to use buffered writes at any time, with no signal to the
application.

Ok, so you're saying there's no _harm_ in specifying O_DSYNC with
O_DIRECT either? :-)

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2009-08-21 17:45:25 UTC

Post by Jamie Lokier
In measurements I've done, disabling a disk's write cache results in
much slower ext3 filesystem writes than using barriers. Others report
similar results. This is with disks that don't have NCQ; good NCQ may
be better.

On a scsi disk and a SATA SSD with NCQ I get different results. Most
worksloads, in particular metadata-intensive ones and large streaming
writes are noticably better just turning off the write cache. The only
onces that benefit from it are relatively small writes witout O_SYNC
or much fsyncs. This is however using XFS which tends to issue much
more barriers than ext3.

Post by Jamie Lokier
Using FUA for all writes should be equivalent to writing with write
cache disabled.
(guest) WRITE
(guest) WRITE
(guest) WRITE
(guest) WRITE
(guest) WRITE
(guest) CACHE FLUSH
(guest) WRITE
(guest) CACHE FLUSH
(guest) WRITE
(guest) WRITE
(guest) WRITE

In the optimal case, yeah.

Post by Jamie Lokier
Assuming that WRITE FUA is equivalent to disabling write cache, we may
expect the WRITE FUA version to run much slower than the CACHE FLUSH
version.

For a workload that only does FUA writes, yeah. That is however the use
case for virtual machines. As I'm looking into those issues I will run
some benchmarks comparing both variants.

Post by Jamie Lokier
It's also too weak, of course, on drives which don't support FUA.
Then you have to use CACHE FLUSH anyway, so the code should support
that (or disable the write cache entirely, which also performs badly).
If you don't handle drives without FUA, then you're back to "integrity
sometimes, user must check type of hardware", which is something we're
trying to get away from. Integrity should not be a surprise when the
application requests it.

As mentioned in the previous mails FUA would only be an optimization
(if it ends up helping) we do need to support the cache flush case.

Post by Christoph Hellwig
I thought about this alot . It would be sensible to only require
the FUA semantics if O_SYNC is specified. But from looking around at
users of O_DIRECT no one seems to actually specify O_SYNC with it.

O_DIRECT with true POSIX O_SYNC is a bad idea, because it flushes
inode metadata (like mtime) too. O_DIRECT|O_DSYNC is better.

O_SYNC above is the Linux O_SYNC aka Posix O_DYNC.

Post by Jamie Lokier
O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for
integrity problems when direct writes are converted to buffered writes
- which applies to all or nearly all OSes according to their
documentation (I've read a lot of them).

It did not happen on IRIX where O_DIRECT originated that did not happen,
neither does it happen on Linux when using XFS. Then again at least on
Linux we provide O_SYNC (that is Linux O_SYNC, aka Posix O_DYSC)
semantics for that case.

Post by Jamie Lokier
Imho, integrity should not be something which depends on the user
knowing the details of their hardware to decide application
configuration options - at least, not out of the box.

That is what I meant. Only doing cache flushes/FUA for O_DIRECT|O_DSYNC
is not what users naively expect. And the wording in hour manpages also
suggests this behaviour, although it is not entirely clear:

O_DIRECT (Since Linux 2.4.10)

Try to minimize cache effects of the I/O to and from this file. In
general this will degrade performance, but it is useful in special
situations, such as when applications do their own caching. File I/O
is done directly to/from user space buffers. The I/O is synchronous,
that is, at the completion of a read(2) or write(2), data is
guaranteed to have been transferred. See NOTES below forfurther
discussion.

(And yeah, the whole wording is horrible, I will send an update once
we've sorted out the semantics, including caveats about older kernels)

Oh, O_SYNC means O_DSYNC? I thought it was the other way around.
Ugh, how messy.

Yes. Except when using XFS and using the "osyncisosync" mount option :)

Post by Christoph Hellwig
The fallback was a relatively recent addition to the O_DIRECT semantics
for broken filesystems that can't handle holes very well. Fortunately
enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
semantics for that already.

Ok, so you're saying there's no _harm_ in specifying O_DSYNC with
O_DIRECT either? :-)

No. In the generic code and filesystems I looked at it simply has no
effect at all.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Ric Wheeler

2009-08-21 19:18:03 UTC

With normal S-ATA disks, streaming write workloads on ext3 run twice as
fast with barriers & write cache enabled in my testing.

Small file workloads were more even if I remember correctly...

ric

In the optimal case, yeah.

Post by Jamie Lokier
Assuming that WRITE FUA is equivalent to disabling write cache, we may
expect the WRITE FUA version to run much slower than the CACHE FLUSH
version.

For a workload that only does FUA writes, yeah. That is however the use
case for virtual machines. As I'm looking into those issues I will run
some benchmarks comparing both variants.

As mentioned in the previous mails FUA would only be an optimization
(if it ends up helping) we do need to support the cache flush case.

O_DIRECT with true POSIX O_SYNC is a bad idea, because it flushes
inode metadata (like mtime) too. O_DIRECT|O_DSYNC is better.

O_SYNC above is the Linux O_SYNC aka Posix O_DYNC.

That is what I meant. Only doing cache flushes/FUA for O_DIRECT|O_DSYNC
is not what users naively expect. And the wording in hour manpages also
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In
general this will degrade performance, but it is useful in special
situations, such as when applications do their own caching. File I/O
is done directly to/from user space buffers. The I/O is synchronous,
that is, at the completion of a read(2) or write(2), data is
guaranteed to have been transferred. See NOTES below forfurther
discussion.
(And yeah, the whole wording is horrible, I will send an update once
we've sorted out the semantics, including caveats about older kernels)

Oh, O_SYNC means O_DSYNC? I thought it was the other way around.
Ugh, how messy.

Yes. Except when using XFS and using the "osyncisosync" mount option :)

Ok, so you're saying there's no _harm_ in specifying O_DSYNC with
O_DIRECT either? :-)

No. In the generic code and filesystems I looked at it simply has no
effect at all.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jamie Lokier

2009-08-22 00:50:06 UTC

It did not happen on IRIX where O_DIRECT originated that did not happen,

IRIX has an unusually sane O_DIRECT - at least according to it's
documentation. This is write(2):

When attempting to write to a file with O_DIRECT or FDIRECT set,
the portion being written can not be locked in memory by any
process. In this case, -1 will be returned and errno will be set
to EBUSY.

AIX however says this:

In order to avoid consistency issues between programs that use
Direct I/O and programs that use normal cached I/O, Direct I/O is
by default used in an exclusive use mode. If there are multiple
opens of a file and some of them are direct and others are not,
the file will stay in its normal cached access mode. Only when
the file is open exclusively by Direct I/O programs will the file
be placed in Direct I/O mode.

Similarly, if the file is mapped into virtual memory via the
shmat() or mmap() system calls, then file will stay in normal
cached mode.

The JFS or JFS2 will attempt to move the file into Direct I/O
mode any time the last conflicting. non-direct access is
eliminated (either by close(), munmap(), or shmdt()
subroutines). Changing the file from normal mode to Direct I/O
mode can be rather expensive since it requires writing all
modified pages to disk and removing all the file's pages from
memory.

Post by Christoph Hellwig
neither does it happen on Linux when using XFS. Then again at least on
Linux we provide O_SYNC (that is Linux O_SYNC, aka Posix O_DYSC)
semantics for that case.

As Ted T'so pointer out, we don't.

That is what I meant. Only doing cache flushes/FUA for O_DIRECT|O_DSYNC
is not what users naively expect.

Oh, I agree with that. That comes from observing that quasi-portable
code using O_DIRECT needs to use O_DSYNC too because several OSes and
filesystems on those OSes revert to buffered writes under some
circumstances, in which case you want O_DSYNC too. That has nothing
to do with hardware caches, but it's a lucky coincidence that
fdatasync() would form a nice barrier function, and O_DIRECT|O_DSYNC
would then make sense as an FUA equivalent.

Post by Christoph Hellwig
And the wording in hour manpages also suggests this behaviour,
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In
general this will degrade performance, but it is useful in special
situations, such as when applications do their own caching. File I/O
is done directly to/from user space buffers. The I/O is synchronous,
that is, at the completion of a read(2) or write(2), data is
guaranteed to have been transferred. See NOTES below forfurther
discussion.

Perhaps in the same way that fsync/fdatasync aren't clear on disk
cache behaviour either. On Linux and some other OSes.

Post by Christoph Hellwig
(And yeah, the whole wording is horrible, I will send an update once
we've sorted out the semantics, including caveats about older kernels)

One thing it's unhelpful about is the performance. O_DIRECT tends to
improve performance for applications that do their own caching, it
also improves performance in whole systems when caching when would
cause memory pressure, and on Linux O_DIRECT is necessary for AIO
which can improve performance.

I have a 166MHz embedded device that I'm using O_DIRECT on to improve
performance - from 1MB/s to 10MB/s.

However if O_DIRECT is changed to force each write(2) through the disk
cache separately, then it will no longer provide this performance
boost at least with some kinds of disk.

That's why it's important not to change it casually. Maybe it's the
right thing to do, but then it will be important to provide another
form of O_DIRECT which does not write through the disk cache, instead
providing a barrier capability.

(...After all, if we believed in integrity above everything then barriers
would be enabled for ext3 by default, *ahem*.)

Probably the best thing to do is look at what other OSes that are used
by databases etc. do with O_DIRECT, and if it makes sense, copy it.

What does IRIX do? Does O_DIRECT on IRIX write through the drive's
cache? What about Solaris?

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Theodore Tso

2009-08-22 02:19:56 UTC

It did not happen on IRIX where O_DIRECT originated that did not happen,

IRIX has an unusually sane O_DIRECT - at least according to it's
When attempting to write to a file with O_DIRECT or FDIRECT set,
the portion being written can not be locked in memory by any
process. In this case, -1 will be returned and errno will be set
to EBUSY.

Can you forward a pointer to an Irix man page which describes its
O_DIRECT semantics (or at least what they claim in their man pages)?
I was looking for one on the web, but I couldn't seem to find any
on-line web pages for Irix.

It'd be nice if we could also get permission from SGI to quote
relevant sections in the "Clarifying Direct I/O Semantics" wiki page
would be welcome, in case we end up quoting more than what someone
might consider fair game for fair use, but for now, I'd be really
happy getting something that I could look out for reference purposes.
Was there any thing more than what you quoted in the Irix write(2) man
page about O_DIRECT?

Thanks,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Theodore Tso

2009-08-22 02:31:40 UTC

Post by Theodore Tso
Can you forward a pointer to an Irix man page which describes its
O_DIRECT semantics (or at least what they claim in their man pages)?
I was looking for one on the web, but I couldn't seem to find any
on-line web pages for Irix.

Never mind, I found it. (And I've added the relevant bits to the wiki
article).

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2009-08-24 02:34:22 UTC

Post by Jamie Lokier
Oh, I agree with that. That comes from observing that quasi-portable
code using O_DIRECT needs to use O_DSYNC too because several OSes and
filesystems on those OSes revert to buffered writes under some
circumstances, in which case you want O_DSYNC too. That has nothing
to do with hardware caches, but it's a lucky coincidence that
fdatasync() would form a nice barrier function, and O_DIRECT|O_DSYNC
would then make sense as an FUA equivalent.

I agree. I do however fear about everything using O_DIRECT that is
around now. Less so about the databases and HPC workloads on expensive
hardware because they usually run on vendor approved scsi disks that
have the write back cache disabled, but rather things like
virtualization software or other things that get run on commodity
hardware.

Then again they already don't get what they expect and never did,
so if we clear document and communicate the O_SYNC (that is Linux
O_SYNC) requirement we might be able to go with this.

Post by Jamie Lokier
Perhaps in the same way that fsync/fdatasync aren't clear on disk
cache behaviour either. On Linux and some other OSes.

The disk write cache really is an implementation detail, it has no
business in Posix.

Posix seems to define the semantics for fdatasync and cor relatively
well (that is if you like the specification speak in there):

"The fdatasync() function forces all currently queued I/O operations
associated with the file indicated by file descriptor fildes to the
synchronised I/O completion state."

"synchronised I/O data integrity completion

o For read, when the operation has been completed or diagnosed if
unsuccessful. The read is complete only when an image of the data has
been successfully transferred to the requesting process. If there were
any pending write requests affecting the data to be read at the time
that the synchronised read operation was requested, these write
requests shall be successfully transferred prior to reading the
data."
o For write, when the operation has been completed or diagnosed if
unsuccessful. The write is complete only when the data specified in the
write request is successfully transferred and all file system
information required to retrieve the data is successfully transferred."

Given that it talks about data retrievable an volatile cache does not
seem to meet the above criteria. But yeah, it's a horrible language.

Post by Jamie Lokier
What does IRIX do? Does O_DIRECT on IRIX write through the drive's
cache? What about Solaris?

IRIX only came pre-packaged with SGI MIPS systems. Which as most of
the more expensive hardware was not configured with write through
caches. Which btw is still the case for all more expensive hardware
I have. The whole issue with volatile write back cache is just too
much of a data integrity nightmare as that you would enable it where
your customers actually care about their data.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jamie Lokier

2009-08-27 14:34:59 UTC

Post by Christoph Hellwig
Then again they already don't get what they expect and never did,
so if we clear document and communicate the O_SYNC (that is Linux
O_SYNC) requirement we might be able to go with this.

I'm thinking, while we're looking at this, that now is a really good
time to split up O_SYNC and O_DSYNC.

We have separate fsync and fdatasync, so it should be quite tidy now.

Then we can document using O_DSYNC on Linux, which is fine for older
versions because it has the same value as O_SYNC at the moment.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Theodore Tso

2009-08-21 22:08:52 UTC

Um, actually, we don't. If we did that, we would have to wait for a
journal commit to complete before allowing the write(2) to complete,
which would be especially painfully slow for ext3.

This question recently came up on the ext4 developer's list, because
of a question of how direct I/O to an preallocated (uninitialized)
extent should be handled. Are we supposed to guarantee synchronous
updates of the metadata by the time write(2) returns, or not? One of
the ext4 developers (I can't remember if it was Mingming or Eric)
asked an XFS developer what they did in that case, and I believe the
answer they were given was that XFS started a commit, but did *not*
wait for the commit to complete before returning from the Direct I/O
write. In fact, they were told (I believe this was from an SGI
engineer, but I don't remember the name; we can track that down if
it's important) that if an application wanted to guarantee metadata
would be updated for an extending write, they had to use fsync() or
O_SYNC/O_DSYNC.

Perhaps they were given an incorrect answer, but it's clear the
semantics of exactly how Direct I/O works in edge cases isn't well
defined, or at least clearly and widely understood.

I have an early draft (for discussion only) what we think it means and
what is currently implemented in Linux, which I've put up, (again, let
me emphasisize) for *discussion* here:

http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics

Comments are welcome, either on the wiki's talk page, or directly to
me, or to the linux-fsdevel or linux-ext4.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Joel Becker

2009-08-21 22:38:37 UTC

I think you mean "not well specified". ;-)

Joel
--
Life's Little Instruction Book #511

"Call your mother."

Joel Becker
Principal Software Developer
Oracle
E-mail: ***@oracle.com
Phone: (650) 506-8127
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Joel Becker

2009-08-21 22:45:18 UTC

Post by Theodore Tso
http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
Comments are welcome, either on the wiki's talk page, or directly to
me, or to the linux-fsdevel or linux-ext4.

In the section on perhaps not waiting for buffered fallback, we
need to clarify that O_DIRECT reads need to know to look in the
pagecache. That is, if we decide that extending O_DIRECT writes without
fsync can return before the data hits the storage, the caller shouldn't
also have to call fsync() just to call read() of data they just wrote!

Joel
--
To spot the expert, pick the one who predicts the job will take the
longest and cost the most.

Joel Becker
Principal Software Developer
Oracle
E-mail: ***@oracle.com
Phone: (650) 506-8127
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jamie Lokier

2009-08-22 00:56:13 UTC

And that's not even a hardware cache issue, just whether filesystem
metadata is written.

AIX behaves like XFS according to documentation:

[ http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm ]

Direct I/O and Data I/O Integrity Completion

Although direct I/O writes are done synchronously, they do not
provide synchronized I/O data integrity completion, as defined by
POSIX. Applications that need this feature should use O_DSYNC in
addition to O_DIRECT. O_DSYNC guarantees that all of the data and
enough of the metadata (for example, indirect blocks) have written
to the stable store to be able to retrieve the data after a system
crash. O_DIRECT only writes the data; it does not write the
metadata.

That's another reason to use O_DIRECT|O_DSYNC in moderately portable
code.

Post by Theodore Tso
I have an early draft (for discussion only) what we think it means and
what is currently implemented in Linux, which I've put up, (again, let
http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
Comments are welcome, either on the wiki's talk page, or directly to
me, or to the linux-fsdevel or linux-ext4.

I haven't read it yet. One thing which comes to mind is it would be
good to summarise what other OSes as well as Linux do with O_DIRECT
w.r.t. data-finding metadata, preallocation, file extending, hole
filling, unaligned access and what alignment is required, block
devices vs. files and different filesystems and behaviour-modifying
mount options, file open for buffered I/O on another descriptor, file
has mapped pages, mlocked pages, and of course drive cache write
through or not.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Theodore Tso

2009-08-22 02:06:07 UTC

Post by Jamie Lokier
[ http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm ]
Direct I/O and Data I/O Integrity Completion
Although direct I/O writes are done synchronously, they do not
provide synchronized I/O data integrity completion, as defined by
POSIX. Applications that need this feature should use O_DSYNC in
addition to O_DIRECT. O_DSYNC guarantees that all of the data and
enough of the metadata (for example, indirect blocks) have written
to the stable store to be able to retrieve the data after a system
crash. O_DIRECT only writes the data; it does not write the
metadata.
That's another reason to use O_DIRECT|O_DSYNC in moderately portable
code.

...or use fsync() when they need to guarantee that data has been
atomically written, but not before. This becomes critically important
if the application is writing into a sparse file, or writing into
uninitalized blocks that were allocated using fallocate(); otherwise,
with O_DIRECT|O_DSYNC, the file system would have to do a commit
operation after each write, which could be a performance disaster.

It's a wiki; contributions to define all of that is welcome. :-)

We may want to carefully consider what we want to guarantee for all
time to application writers, and what we might want to leave open to
allow for performance optimizations by the kernel, though.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Theodore Tso

2009-08-22 02:11:37 UTC

Post by Joel Becker

Yeah, I guess we can only do that if the filesystem guarantees
coherence between the page cache and O_DIRECT reads; it's been a long
while since I've studied that code, so I'm not sure whether all
filesystems that support O_DIRECT provide this coherency (since I
thought it was provided in the generic O_DIRECT routines, isn't it?)
or not.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2009-08-24 02:42:33 UTC

Post by Theodore Tso
Yeah, I guess we can only do that if the filesystem guarantees
coherence between the page cache and O_DIRECT reads; it's been a long
while since I've studied that code, so I'm not sure whether all
filesystems that support O_DIRECT provide this coherency (since I
thought it was provided in the generic O_DIRECT routines, isn't it?)
or not.

It's provided in the generic code, yes (or at least appears to).

Note that xfstests has quite a few tests exercising it.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2009-08-24 02:37:56 UTC

Post by Joel Becker
In the section on perhaps not waiting for buffered fallback, we
need to clarify that O_DIRECT reads need to know to look in the
pagecache. That is, if we decide that extending O_DIRECT writes without
fsync can return before the data hits the storage, the caller shouldn't
also have to call fsync() just to call read() of data they just wrote!

The way the O_DIRECT fallback is implemented currenly is that data does
hit the disk before return, thanks to a:

err = do_sync_mapping_range(file->f_mapping, pos, endbyte,
SYNC_FILE_RANGE_WAIT_BEFORE|
SYNC_FILE_RANGE_WRITE|
SYNC_FILE_RANGE_WAIT_AFTER);

which I expected to also sync the required metdata to disk, which
it doesn't. Which btw are really horrible semantics given that
we export that beast to userspace as a separate system call.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Dave Chinner

2009-08-26 06:34:55 UTC

That would have been Eric asking me. My answer that O_DIRECT does
not imply any new data integrity guarantees associated with a
write(2) call - it just avoids system caches. You get the same
guarantees of resiliency as a non-O_DIRECT write(2) call at
completion - it may or may notbe there if you crash. If you want
some guarantee of integrity, then you need to use O_DSYNC, O_SYNC or
call f[data]sync(2) just like all other IO.

Also, note that direct IO is not necessarily synchronous - you can
do asynchronous direct IO.....

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jamie Lokier

2009-08-26 15:01:02 UTC

This post might be inappropriate. Click to display it.

Theodore Tso

2009-08-26 18:47:00 UTC

Post by Jamie Lokier
1. If the automatic O_SYNC fallback mentioned by Christopher is
currently implemented at all, even in a subset of filesystems,
then I think it should be removed.

Could you clarify what you meant by "it" above? I'm not sure I
understood what you were referring to.

Also, it sounds like you and Dave are mostly agreeing with the what
I've written here; is that true?

http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics

I'm trying to get consensus that this is both (a) an accurate
description of the state of affiars in Linux, and (b) that it is what
we think things should be, before I start circulating it around
application developers (especially database developers), to make sure
they have the same understanding of O_DIRECT semantics as we have.

Post by Jamie Lokier
4. On drives which need it, fdatasync/fsync must trigger a drive
cache flush even when there is no dirty page cache to write,
because dirty pages may have been written in the background
already, and because O_DIRECT writes dirty the drive cache but
not the page cache.

I agree we *should* do this, but we're going to take a pretty serious
performance hit when we do. Mac OS chickened out and added an
F_FULLSYNC option:

http://developer.apple.com/documentation/Darwin/Reference/Manpages/man2/fcntl.2.html

The concern is that there are GUI programers that want to update state
files after every window resize or move, and after click on a web
browser. These GUI programmers then get cranky when changes get lost
after proprietary video drivers cause the laptop to lock up. If we
make fsync() too burdensome, then fewer and fewer applications will
use it. Evidently the MacOS developers decided the few applications
who really cared about doing device cache flushes were much smaller
than the fast number of applications that need a lightweight file
flush. Should we do the same?

It seems like an awful cop-out, but having seen, up front and
personal, how "agressively stupid" some desktop programmers can be[1],
I can **certainly** understand why Apple chose the F_FULLSYNC route.

[1] http://josefsipek.net/blahg/?p=364

- Ted
(who really needs to get himself
an O_PONIES t-shirt :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jamie Lokier

2009-08-27 14:50:51 UTC

Post by Jamie Lokier
1. If the automatic O_SYNC fallback mentioned by Christopher is
currently implemented at all, even in a subset of filesystems,
then I think it should be removed.

Could you clarify what you meant by "it" above? I'm not sure I
understood what you were referring to.

I meant the automatic O_SYNC fallback, in other words, if O_DIRECT
falls back to buffered writing, Chris said it automatically did
O_SYNC, and you followed up by saying it doesn't :-)

All I'm saying is if there's _some_ code doing O_SYNC writing when
O_DIRECT falls back to buffered, it should be ripped out. Leave the
syncing to explicit fsync calls from userspace.

I agree we *should* do this, but we're going to take a pretty serious
performance hit when we do. Mac OS chickened out and added an

I know about that one. (I've done quite a lot of research on O_DIRECT
and fsync behaviours). It's really unfortunate that they didn't
provide F_FULLDATASYNC, which is what a database or VM would ideally
use.

I think Vxfs provides a whole suite of mount options to adjust what
O_SYNC and fdatasync actually do.

Post by Theodore Tso
The concern is that there are GUI programers that want to update state
files after every window resize or move, and after click on a web
browser. These GUI programmers then get cranky when changes get lost
after proprietary video drivers cause the laptop to lock up. If we
make fsync() too burdensome, then fewer and fewer applications will
use it. Evidently the MacOS developers decided the few applications
who really cared about doing device cache flushes were much smaller
than the fast number of applications that need a lightweight file
flush. Should we do the same?

If fsync is cheap but doesn't commit changes properly - what's the
point in encouraging applications to use it? Without drive cache
flushes, they will still lose changes occasionally.

(Btw, don't blame proprietary video drivers. I see too many lockups
with open source video drivers too.)

Post by Theodore Tso
It seems like an awful cop-out, but having seen, up front and
personal, how "agressively stupid" some desktop programmers can be[1],
I can **certainly** understand why Apple chose the F_FULLSYNC route.

I did see a few of those threads, and I think your solution was genius.
Genius at keeping people quiet that is :-)

But it's also a good default. fsync() isn't practical in shell
scripts or Makefiles, although that's really because "mv" lacks the
fsync option...

Personally I side with "want some kind of full-system asynchronous
transactionality please". (Possibly aka. O_PONIES :-)

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2009-08-21 14:20:08 UTC

Post by Jens Axboe
I've talked to Chris about this in the past too, but I never got around
to benchmarking FUA for O_DIRECT. It should be pretty easy to wire up
without making too many changes, and we do have FUA support on most SATA
drives too. Basically just a check in the driver for whether the
if (rq_data_dir(rq) == WRITE && rq_is_sync(rq))
WRITE_FUA;
I know that FUA is used by that other OS, so I think we should be golden
on the hw support side.

Just doing FUA should be pretty easy, in fact from my reading of the
code we already use FUA for barriers if supported, that is only drain
the queue, do a pre-flush for a barrier and then issue the actual
barrier write as FUA.

I can play around with getting rid of the pre-flush and doing cache
flush based emulation if FUA is not supported if you're fine with that.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

James Bottomley

2009-08-21 15:06:10 UTC

I've never really understood why FUA is considered equivalent to a
barrier. Our barrier semantics are that all I/Os before the barrier
should be safely on disk after the barrier executes. The FUA semantics
are that *this write* should be safely on disk after it executes ... it
can still leave preceding writes in the cache. I can see that if you're
only interested in metadata that making every metadata write a FUA and
leaving the cache to sort out data writes does give FS image
consistency.

How does FUA give us linux barrier semantics?

James

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2009-08-21 15:23:19 UTC