Re: [bug] ext{3,4}: __find_get_block_slow() failed on 3.0.3

From: Jan Kara <jack@suse.cz>
Date: 2011-09-05 12:59:50
Also in: lkml

  Hi,

On Sat 20-08-11 01:51:49, Thilo-Alexander Ginkel wrote:

while rsyncing a large amount (> 1TB) of data from an ext3 to an ext4
on my machine [1], I encountered an issue where rsync and syslog
eventually started consuming 100% CPU and my syslog was flooded [2]
with error messages:

-- 8< --

quoted

kernel: [101543.047293] b_state=0x00000029, b_size=>[10ock01543.04>[101543.047321] __find_get_block_slow() failed. block=328204473, b_blocknr=51867812025
kernel: [101543.047330] b_state=0x00000029, b_size=4096
kernel: [101543.047>[10ock01543.047348] b_state=0x00000029, b_size=4096
kernel: [101543.047353] device blocksize: 4096
kernel: [101543.047359] __find_get_block_slow() failed. block=328204473, b01543.0>[10ock01543.047>[1ock01543.047404] b_state=0x00000029, b_size=4096
kernel: [101543.047409] device blocksize: 4096
kernel: [101543.047414] __find_get_block_slow() failed. block=328204473, b_blocknr=51867812025
kernel: [10154ock01543.0>[1ock01543.0492>[1ock01543.0492>[1ock01543.049>[1ock01543.0492>[1ock01543.0>[1ock01543.049>[1ock01543.049>[1ock01543.0492>[10ock01543.0>[1ock=01543.04>[1ock01543.>[1ock01543.0493>[1ock01543.049>[1ock01543.04>[1ock01543.0493>[1ock01543.04941>[1ock01543.0494>[1ock01543.0>[1ock01543.049>[10ock01543.0>[1ock01543.04>[1ock01543.04>[1ock01543.0495>[1ock01543.0495>[1ock01543.0495>[1ock01543.0496>[1ock01543.04>[1ock01543.04>[1ock01543.049>[1ock01543.049>[1ock01543.04>[1ock01543.0497>[1ock01543.0>[1ock01543.0497>[1ock01543.0497>[1ock01543.0498>[1ock01543.0498>[1ock01543.04>[1ock01543.04>[1ock01543.0498>[1ock01543.0498>[1ock01543.0499>[1ock01543.0499>[1ock01543.04>[101543.049967] __find_get_block_slow() failed. block=328204473, b_blocknr=51867812025
kernel: [101543.049975] b_state=0x00000029, b_size=4096
kernel: [101543.049980] device blocksize: 4096
kernel: [101543.049986] __find_get_block_slow() failed. block=328204473, b_blocknr=51867812025

-- 8< --

These are not preceded by any other error messages (about possible FS
inconsistencies) as has been the case in the past when bugs related to
this error message were reported.

Judging by the block size, the possibly corrupt volume is the ext3 one
(the ext4 volume has a block size of 2048).

A forced fsck.ext{3,4} of the source and target partitions did not
show any inconsistencies.

Any ideas?

  Something has corrupted your buffer head structure in memory (and we then
infinitely looped in __getblk_slow()). bh->b_blocknr has been 0xC139000B9
which it should have been 0x139000B9 (5th byte has been changed from 0x00
to 0x0C). It might be a hw fault, buggy driver, or some other bug - hard to
say. You might want to run memtest for some time, or enable some kernel debug
options (DEBUG_PAGEALLOC, DEBUG_SLAB) which might catch the code causing
corruption (this assumes it's at least occasionally reproducible and your
are willing to take the performance hit)...

								Honza
-- 
Jan Kara [off-list ref]
SUSE Labs, CR

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help