Thread (91 messages) 91 messages, 3 authors, 2014-08-02

Re: [PATCH 16/37] libext2fs: support allocating uninit blocks in bmap2()

From: Darrick J. Wong <hidden>
Date: 2014-05-06 19:59:46

On Tue, May 06, 2014 at 05:45:01PM +0200, Lukáš Czerner wrote:
On Thu, 1 May 2014, Darrick J. Wong wrote:
quoted
Date: Thu, 01 May 2014 16:14:07 -0700
From: Darrick J. Wong <redacted>
To: tytso@mit.edu, darrick.wong@oracle.com
Cc: linux-ext4@vger.kernel.org
Subject: [PATCH 16/37] libext2fs: support allocating uninit blocks in bmap2()

In order to support fallocate, we need to be able to have
ext2fs_bmap2() allocate blocks and put them into uninitialized
extents.  There's a flag to do this in the extent code, but it's not
exposed to the bmap2 interface, so plumb that in.  Eventually fuse2fs
or somebody will use it.

Signed-off-by: Darrick J. Wong <redacted>
---
 lib/ext2fs/bmap.c      |   24 ++++++++++++++++++++++--
 lib/ext2fs/ext2fs.h    |    1 +
 lib/ext2fs/mkjournal.c |   17 +++++++++++++++++
 3 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/lib/ext2fs/bmap.c b/lib/ext2fs/bmap.c
index c1d0e6f..a4dc8ef 100644
--- a/lib/ext2fs/bmap.c
+++ b/lib/ext2fs/bmap.c
@@ -72,6 +72,11 @@ static _BMAP_INLINE_ errcode_t block_ind_bmap(ext2_filsys fs, int flags,
 					    block_buf + fs->blocksize, &b);
 		if (retval)
 			return retval;
+		if (flags & BMAP_UNINIT) {
+			retval = ext2fs_zero_blocks2(fs, b, 1, NULL, NULL);
+			if (retval)
+				return retval;
+		}
 
 #ifdef WORDS_BIGENDIAN
 		((blk_t *) block_buf)[nr] = ext2fs_swab32(b);
@@ -214,10 +219,13 @@ static errcode_t extent_bmap(ext2_filsys fs, ext2_ino_t ino,
 	errcode_t		retval = 0;
 	blk64_t			blk64 = 0;
 	int			alloc = 0;
+	int			set_flags;
+
+	set_flags = bmap_flags & BMAP_UNINIT ? EXT2_EXTENT_SET_BMAP_UNINIT : 0;
 
 	if (bmap_flags & BMAP_SET) {
 		retval = ext2fs_extent_set_bmap(handle, block,
-						*phys_blk, 0);
+						*phys_blk, set_flags);
 		return retval;
 	}
 	retval = ext2fs_extent_goto(handle, block);
@@ -254,7 +262,7 @@ got_block:
 		alloc++;
 	set_extent:
 		retval = ext2fs_extent_set_bmap(handle, block,
-						blk64, 0);
+						blk64, set_flags);
 		if (retval) {
 			ext2fs_block_alloc_stats2(fs, blk64, -1);
 			return retval;
@@ -345,6 +353,12 @@ errcode_t ext2fs_bmap2(ext2_filsys fs, ext2_ino_t ino, struct ext2_inode *inode,
 		goto done;
 	}
 
+	if ((bmap_flags & BMAP_SET) && (bmap_flags & BMAP_UNINIT)) {
+		retval = ext2fs_zero_blocks2(fs, *phys_blk, 1, NULL, NULL);
+		if (retval)
+			goto done;
+	}
+
 	if (block < EXT2_NDIR_BLOCKS) {
 		if (bmap_flags & BMAP_SET) {
 			b = *phys_blk;
@@ -360,6 +374,12 @@ errcode_t ext2fs_bmap2(ext2_filsys fs, ext2_ino_t ino, struct ext2_inode *inode,
 			retval = ext2fs_alloc_block(fs, b, block_buf, &b);
 			if (retval)
 				goto done;
+			if (bmap_flags & BMAP_UNINIT) {
+				retval = ext2fs_zero_blocks2(fs, b, 1, NULL,
+							     NULL);
+				if (retval)
+					goto done;
+			}
 			inode_bmap(inode, block) = b;
 			blocks_alloc++;
 			*phys_blk = b;
diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index 599c972..819a14a 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -527,6 +527,7 @@ typedef struct ext2_icount *ext2_icount_t;
  */
 #define BMAP_ALLOC	0x0001
 #define BMAP_SET	0x0002
+#define BMAP_UNINIT	0x0004
 
 /*
  * Returned flags from ext2fs_bmap
diff --git a/lib/ext2fs/mkjournal.c b/lib/ext2fs/mkjournal.c
index 884d9c0..ecc3912 100644
--- a/lib/ext2fs/mkjournal.c
+++ b/lib/ext2fs/mkjournal.c
@@ -174,6 +174,23 @@ errcode_t ext2fs_zero_blocks2(ext2_filsys fs, blk64_t blk, int num,
 			return ENOMEM;
 		memset(buf, 0, fs->blocksize * STRIDE_LENGTH);
 	}
+
+	/* Try discard, if it zeroes data... */
+	if (io_channel_discard_zeroes_data(fs->io)) {
+		memset(buf + fs->blocksize, 0, fs->blocksize);
+		retval = io_channel_discard(fs->io, blk, num);
+		if (retval)
+			goto skip_discard;
+		retval = io_channel_read_blk64(fs->io, blk, 1, buf);
+		if (retval)
+			goto skip_discard;
+		if (memcmp(buf, buf + fs->blocksize, fs->blocksize) == 0)
+			return 0;
+		/* Hah!  Discard doesn't zero! */
+		fs->io->flags &= ~CHANNEL_FLAGS_DISCARD_ZEROES;
+	}
+skip_discard:
You did not mention that in the description, but this is actually a
problem. The reason is that discard might not be reliable on some
devices. This has been discussed several times and I am not the only
one who've seen that even if the device itself says that it will
return zeroes from discarded regions sometimes it might return data.
I agree that the storage not living up to the interface it advertises is a
problem, hence the verification step that will unset the io channel flag if it
finds that the device is lying.

On the other hand, I wonder if this ought to be abstracted away in an
io_channel_zero() call that takes care of figuring out if it can do a zeroing
discard or if it has to write a block of zeroes.

Or, are you worried that a discard and immediate re-read will appear to work,
but that a later re-read will return non-zero data?
I would rather avoid this kind of optimization. However if the
underlying "device" is a loop device then it will be reliable if
it's supported. Also if then underlying "device" is a image then we
can just simply use punch hole.
But static whitelisting is also problematic -- what if the storage device is an
AHCI (or virtio-scsi) disk in QEMU that's ultimately backed by a file that we
can punch_hole?  How do we distinguish that from an SSD hooked up to SATA
hardware?

In the qemu emulated AHCI case we ought to be able to zeroing discard, if
advertised.  I thought it was a reasonable compromise to trust that it works
and verify the results afterward.

--D
Thanks!
-Lukas
quoted
+
 	/* OK, do the write loop */
 	j=0;
 	while (j < num) {

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help