Re: Kernel 3.0.0 + ext4 + ceph == ...
From: Christian Brunner <hidden>
Date: 2011-08-18 09:19:42
Also in:
ceph-devel
Possibly related (same subject, not in this thread)
- 2011-11-15 · Re: Fwd: Kernel 3.0.0 + ext4 + ceph == ... · Eric Sandeen <hidden>
- 2011-07-30 · Fwd: Kernel 3.0.0 + ext4 + ceph == ... · Christian Brunner <hidden>
I'm sorry, that I have to correct this: The problem is still happening with 3.0.1. Although it only seems to happen under high load now. I also did some tracing (with 3.0.0 as the problem is easier to reproduce here). What might be interesting to note is, that the corruption does not occur, when I do an "strace -f cosd". (Maybe a race condition?). To reproduce the problem I have now setup a ceph cluster on a single machine with replication between /ceph/osd.000 and /ceph/osd.001. My setup now has only two active placement groups with 2 objects. The corruption is happening, when I start replication from osd.000 to osd.001. It is reproducible most of the time (but not allways), when I do the following: # mkfs.ext4 -T largefile /dev/sdb1 # mount -o noatime,user_xattr /dev/sdb1 /ceph/osd.001/ # cosd -i 001 --mkjournal --mkfs --monmap /tmp/monmap # /usr/bin/cosd -d -i 001 -c /etc/ceph/ceph.conf ### wait until replication has finished and then stop the cosd # umount /dev/sdb1 # fsck.ext4 -f /dev/sdb e2fsck 1.41.12 (17-May-2010) Pass 1: Checking inodes, blocks, and sizes Inode 43, i_blocks is 8, should be 16. Fix<y>? no Inode 2078, i_blocks is 24, should be 16. Fix<y>? no I can also provide an e2image with the metadata and the strace output of the cosd, if this would be helpful. Regards, Christian 2011/8/8 Christian Brunner [off-list ref]:
I tried 3.0.1 today, which contains the commit Theodore suggested and was no longer able to reproduce the problem. So I think the corruption we have seen is indeed related to: commit 7132de744ba76930d13033061018ddd7e3e8cd91 Author: Maxim Patlasov [off-list ref] Date: Sun Jul 10 19:37:48 2011 -0400 ext4: fix i_blocks/quota accounting when extent insertion fails I will now try to apply this patch to the RHEL6.1 kernel and see what happens... Thanks for your help. Christian 2011/8/3 Yehuda Sadeh Weinraub [off-list ref]:quoted
On Wed, Aug 3, 2011 at 7:16 AM, Christian Brunner [off-list ref] wrote: ...quoted
I tried to reproduce this without ceph, but wasn't able to... In the meantime it seams, that I can also see the side effects on the librbd side: I get an "librbd: data error!" when I do an "rbd copy". When I look at the librbd code this is related to a sparse_read not returning the right size of the object. I don't know if it helps, but I think that the problem is also related to sparse file usage.There were a few sparse-read issues that we fixed not too long ago, but should have been fixed for at least the previous ceph version. I'm not sure what version you're using. There was a ext4 fiemap issue that I was hitting on specific environments but couldn't determine whether it was fixed in later kernel versions (I was using 2.6.32). Now is a good time to try and get to the bottom of it. Here's a script I was using to reproduce it: #!/bin/sh dd if=/dev/urandom of=bla bs=1 seek=$((0x6f000)) count=$((0x1000)); sync dd if=/dev/urandom of=bla bs=1 seek=$((0x70000)) count=$((0x1000)); sync dd if=/dev/urandom of=bla bs=1 seek=$((0x71000)) count=$((0x1000)); sync dd if=/dev/urandom of=bla bs=1 seek=$((0x72000)) count=$((0x1000)); sync dd if=/dev/urandom of=bla bs=1 seek=$((0x73000)) count=$((0x1000)); sync dd if=/dev/urandom of=bla bs=1 seek=$((0x74000)) count=$((0x2000)); sync dd if=/dev/urandom of=bla bs=1 seek=$((0x2ae000)) count=$((0x2000)); sync You can compile and run the following utility to dump all the extents: http://pastebin.com/h2Cnpk2Q Thanks, Yehuda Oh, btw, You can effectively disable the use of fiemap by setting the 'filestore fiemap threshold' config option with large enough value (e.g., anything bigger than 4 MB should be enough for rbd). -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html