Re: [PATCH 0/6] Extended file stat system call

From: Dave Chinner <hidden>
Date: 2012-04-28 00:38:33
Also in: linux-api, linux-cifs, linux-fsdevel, linux-nfs

Possibly related (same subject, not in this thread)

2012-05-10 · Re: [PATCH 0/6] Extended file stat system call · David Howells <dhowells@redhat.com>
2012-04-28 · Re: [PATCH 0/6] Extended file stat system call · Dave Chinner <david@fromorbit.com>
2012-04-27 · Re: [PATCH 0/6] Extended file stat system call · Paul Eggert <hidden>
2012-04-27 · Re: [PATCH 0/6] Extended file stat system call · Andreas Dilger <hidden>
2012-04-27 · Re: [PATCH 0/6] Extended file stat system call · Steve French <hidden>

On Thu, Apr 26, 2012 at 09:22:04PM -0600, Andreas Dilger wrote:

On 2012-04-26, at 7:06 PM, Dave Chinner wrote:

quoted

On Thu, Apr 19, 2012 at 03:05:58PM +0100, David Howells wrote:

quoted

Implement a pair of new system calls to provide extended and further extensible stat functions.

The second of the associated patches is the main patch that provides these new system calls:

	ssize_t ret = xstat(int dfd,
			    const char *filename,
			    unsigned atflag,
			    unsigned mask,
			    struct xstat *buffer);

	ssize_t ret = fxstat(int fd,
			     unsigned atflag,
			     unsigned mask,
			     struct xstat *buffer);

which are more fully documented in the first patch's description.

These new stat functions provide a number of useful features, in summary:

(1) More information: creation time, inode generation number, data
    version number, flags/attributes.  A subset of these is available
    through a number of filesystems (CIFS, NFS, AFS, Ext4 and BTRFS).

If we are adding per-inode flags, then what do we do with filesystem
specific flags? e.g. XFS has quite a number of per-inode flags that
don't align with any other filesystem (e.g. filestream allocator,
real time file, behaviour inheritence flags, etc), but may be useful
to retrieve in such a call. We currently have an ioctl to get that
information from each inode. Have you thought about how to handle
such flags?

I'm sympathetic to your cause, but I don't want this to degrade into
the same morass that it did last time when every attribute under the
sun was added to the call.

Understood, which is why I'm not asking for everything under the sun
to be supported. I'm more interested in finding the useful subset of
information that a typical application might make use of.

The intent is to replace the stat() call
with something that can avoid overhead on filesystems for which some
attributes are expensive, and that applications may not need.  Some
common attributes were added that are used by multiple filesystems.

If it is too filesystem-specific, and there is little possibility
that these attributes will be usable on other filesystems, then it
should remain a filesystem specific ioctl() call.

Right, that's why I didn't mention the real-time bits, the
filestream allocation bits, or other things that are tightly bound
to the way XFS works....

If you can make
a case that these attributes have value on a few other filesystems,
and applications are reasonably likely to be able to use them, and
their addition does not make the API overly complex, then suggest
away.

Exactly my thoughts ;)

quoted

Along the same lines, filesytsems can have different allocation
constraints to IO the filesystem block size - ext4 with it's
bigalloc hack, XFS with it's per-inode extent size hints and the
realtime device, etc. Then there's optimal IO characteristics
(e.g. geometery hints like stripe unit/stripe width for the
allocation policy of that given file) that applications could use
if they were present rather than having to expose them through
ioctls that nobody even knows about...

There is already "optimal IO size" that the application can use,
how do the geometry hints differ?

Have a look at how XFS overloads stat.st_blksize depending on the
filesystem and inode config. It's amazingly convoluted, and based on
a combination of filesystem geometry, inode bits and mount options:

xfs_vn_getattr()
....
                if (XFS_IS_REALTIME_INODE(ip)) {
                        /*
                         * If the file blocks are being allocated from a
                         * realtime volume, then return the inode's realtime
                         * extent size or the realtime volume's extent size.
                         */
                        stat->blksize =
                                xfs_get_extsz_hint(ip) << mp->m_sb.sb_blocklog;
                } else
                        stat->blksize = xfs_preferred_iosize(mp);
......

xfs_extlen_t
xfs_get_extsz_hint(
        struct xfs_inode        *ip)
{
        if ((ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE) && ip->i_d.di_extsize)
                return ip->i_d.di_extsize;
        if (XFS_IS_REALTIME_INODE(ip))
                return ip->i_mount->m_sb.sb_rextsize;
        return 0;
}

....

static inline unsigned long
xfs_preferred_iosize(xfs_mount_t *mp)
{
        if (mp->m_flags & XFS_MOUNT_COMPAT_IOSIZE)
                return PAGE_CACHE_SIZE;
        return (mp->m_swidth ?
                (mp->m_swidth << mp->m_sb.sb_blocklog) :
                ((mp->m_flags & XFS_MOUNT_DFLT_IOSIZE) ?
                        (1 << (int)MAX(mp->m_readio_log, mp->m_writeio_log)) :
                        PAGE_CACHE_SIZE));
}

All of that can be exported as 4 parameters for normal files:

	allocation block size 	(extent size hint)
	minimum io size		(PAGE_CACHE_SIZE)
	preferred minimum IO size (mp->m_readio_log/mp->m_writeio_log)
	best aligned IO size	(stripe width)

And for realtime files it's a bit different because of the
block-based bitmap allocator it uses:

	allocation block size	(extent size hint)
	minimum io size		(PAGE_CACHE_SIZE)
	preferred minimum IO size (extent size hint)
	best aligned IO size	(some multiple of extent size hint)

Userspace is able to handle
st_blksize of several MB in size without problems, and any sane
application will do the IO sized + aligned on multiples of this.

Actually, some applications still have problems with that. That's
the reason we only expose stripe widths in st_blksize when a mount
option is set. Stripe widths are known to get into the tens of MB,
and applications using st_blksize for memory allocation of IO
buffers tend to get into trouble with those.

That's why I'd prefer specific optimal IO hints - we don't have to
overload st_blksize with lots of meanings to pass what is relatively
trivial information back to the application.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help