Direct IO and Page cache
From: Chinmay V S <hidden>
Date: 2013-07-26 10:31:34
On Fri, Jul 26, 2013 at 6:21 PM, Chinmay V S [off-list ref] wrote:
On Fri, Jul 26, 2013 at 12:02 PM, Kumar amit mehta [off-list ref] wrote:quoted
On Fri, Jul 26, 2013 at 05:14:21PM +0800, Chinmay V S wrote:quoted
quoted
We have direct I/O(O_DIRECT), for example raw devices(/dev/rawctl) that map to the block devices and we also have page cache. Now If I've understood this correctly, direct I/O will bypass this page cache, which is fine, I'll not get into the performance debate, but what about data consistency. Kernel cannot and __should'nt__ try to control how the applications are being written. So one bad day somebody comes up with an application which does both these two types of IO(one that goes through page cache and the other that doesn't) and in that application, one instance is writing directly to the backend device and the other instance, who is not aware of this write, goes ahead and writes to the page cache, and that write would be written later to the backend device. So wouldn't we end up corrupting the on disk data.Yes. And that is the responsibility of the application. While the existence of O_DIRECT may not be common sense, anyone who knows about it *must* know that it bypasses the kernel page-cache and hence *must* know the consequences of doing cached and direct I/O on the same file simultaneously.quoted
I can think of multiple other scenarios which could corrupt the on-disk data, if there isn't any safeguarding policies employed by the kernel. But I'm very much sure that kernel is aware of such nasty attempts, and I'd like to know how does kernel takes care of this.O_DIRECT is an explicit flag not enabled by default. It is the app's responsibility to ensure that it does NOT misuse the feature. Essentially specifying the O_DIRECT flag is the app's way of saying - "Hey kernel, i know what i am doing. Please step aside and let me talk to the hardware directly. Please do NOT interfere." The kernel happily obliges. Later, the app should NOT go crying back to kernel (and blaming it), if the app manages to screw-up the direct "relationship" with the hardware.So leaving the hardware at the mercy of the application doesn't sound like a good practice. This __may__ compromise kernel stability too. Also think of this: In app1: fdx = open("blah" , O_RW|O_DIRECT); write(fdx,buf,sizeof(buf)); In app2(unaware of app1): fdy = open("blah", O_RW); write(fdy,buf, sizeof(buf)); I think this isn't highly unlikely to do, and if you agree with me then we may end up with same could-be/would-be data-corruption. Now who should be blamed here, app1, app2 or the kernel? Or it will be handled differently here?As long as both app1 and app2 are managing separate files (even on the same underlying storage media), the situation looks good. From an app developer's perspective : In case both the apps do I/O on the same file then it implies knowledge of the other app. (Otherwise how would the second app know that the file exists at such and such location?) And hence the second app really ought to think about what it is going to do. case1: app1 uses regular I/O; ==> app2 should NOT use direct I/O. case2: app1 uses direct I/O; ==> app2 should NOT use regular I/O. From a kernel developer's perspective : The kernel driver guarantees coherency between then page-cache and data transferred using O_DIRECT. Refer to the page-15 of this deck[1] that talks about the design of O_DIRECT. In either case the bigger problem lies in the fact that both the apps need to work out a mutex mechanism to prevent the handful of readers-writers problems[2] when both try to read/write from the same file simultaneously. So it is more important(in fact, downright necessary) to ensure mutual exclusion between the 2 apps during I/O. Otherwise one of them will end-up overwriting the changes made by the other, unless both the apps are doing ONLY read()s. [1] http://www.ukuug.org/events/linux2001/papers/html/AArcangeli-o_direct.html [2] http://en.wikipedia.org/wiki/Readers-writers_problem regards ChinmayVS
TL;DR 1. Do not worry about coherency between the page-cache and the data transferred using O_DIRECT. The kernel will invalidate the cache after an O_DIRECT write and flush the cache before an O_DIRECT read. 2. Use mutexes or semaphores(or any of the numerous options [1]) to prevent the usual synchronisation problems during IPC using a shared file. [1] http://beej.us/guide/bgipc/output/html/singlepage/bgipc.html regards ChinmayVS