Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs

From: Dave Chinner <hidden>
Date: 2011-01-24 04:29:59
Also in: linux-fsdevel

On Thu, Jan 20, 2011 at 01:44:57PM +0800, Shaohua Li wrote:

On Thu, 2011-01-20 at 12:41 +0800, Dave Chinner wrote:

quoted

On Wed, Jan 19, 2011 at 08:10:14PM -0800, Andrew Morton wrote:

quoted

On Thu, 20 Jan 2011 11:21:49 +0800 Shaohua Li [off-list ref] wrote:

quoted

It seems to return a single offset/length tuple which refers to the
btrfs metadata "file", with the intent that this tuple later be fed
into a btrfs-specific readahead ioctl.

I can see how this might be used with say fatfs or ext3 where all
metadata resides within the blockdev address_space.  But how is a
filesytem which keeps its metadata in multiple address_spaces supposed
to use this interface?

Oh, this looks like a big problem, thanks for letting me know such
filesystems. is it possible specific filesystem mapping multiple
address_space ranges to a virtual big ranges? the new ioctls handle the
mapping.

I'm not sure what you mean by that.

ext2, minix and probably others create an address_space for each
directory.  Heaven knows what xfs does (for example).

In 2.6.39 it won't even use address spaces for metadata caching.

Besides, XFS already has pretty sophisticated metadata readahead
built in - it's one of the reasons why the XFS directory code scales
so well on cold cache lookups of arge directories - so I don't see
much need for such an interface for XFS.

Perhaps btrfs would be better served by implementing speculative
metadata readahead in the places where it makes sense (e.g. readdir)
bcause it will improve cold-cache performance on a much wider range
of workloads than at just boot-time....

I don't know about xfs. A sophisticated metadata readahead might make
metadata async, but I thought it's impossible it can removes the disk
seek.

Nothing you do will remove the disk seek. What readahead is supposed
to do is  _minimise the latency_ of the disk seek.

Since metadata and data usually lives in different disk block
ranges, doing data readahead will unavoidable read metadata and cause
disk seek between reading data and metadata.

Which comes back to how well the filesystem lays out the metadata
related to the data that needs to be read. In the case of XFS, the
metadata it needs is already in the inode, so once the inodes are
read into memory, there is no extra metadata seeks between data
seeks.

That is, if you are using XFS all you need to do in terms of
metadata readahead is stat every file needed by the boot process.
The optimal order for doing this is simply by ordering them in
ascending inode number. IOWs, the problem can be optimised without
any special kernel interfaces to do metadata readahead, especially
if you multithread the stat() walk to avoid blocking on IO that
metadata readahead hasn't already brought into cache....

IIRC, btrfs tends to keep all it's per-inode metadata close together
like XFS does, so it should be read at the same time the inode is
read.

Indeed, the dependencies of readahead are pretty well understood.  A
demonstration of optimising reading of file data across a complex
directory heirarchy is well deomonstrated by this little tool from
Chris Mason:

http://oss.oracle.com/~mason/acp/

I suspect that applying such a technique to the problem of optimising
boot-time IO pattern with net you the same gains as this new kernel
API will. And it will do it in a manner that is filesystem
agnostic...

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help