Re: Enable the skip_copy feature will results in data integrity issue in raid5 degraded mode.
From: Shaohua Li <shli@kernel.org>
Date: 2017-02-15 00:36:23
On Tue, Feb 14, 2017 at 11:48:51AM -0800, Shaohua Li wrote:
On Mon, Feb 13, 2017 at 05:07:45PM +0800, Chien Lee wrote:quoted
Hello, Recently we find a bug about skip_copy feature in raid5 degraded mode. In the beginning, we enable the skip_copy feature to speed up system’s write performance. But when the system has database read/write I/O continually in raid5 degraded mode, the Mongo DB will detect the checksum error and generate related debug log. The following is the testing detail. a. Enable skip_copy --> Checksum error logs from Mongo DB 2017-02-06T11:54:56.537+0800 E STORAGE [conn7] WiredTiger (0) [1486353296:537114][52:0x7f98396a4700], file:collection-110-3235234017846331078.wt, WT_CURSOR.next: read checksum error for 4096B block at offset 61440: calculated block checksum of 1363526237 doesn't match expected checksum of 2969711960 b. Disable skip_copy --> Mongo DB has no checksum error. We've pretty sure that it must be a bug by our repeated database I/O testing. When skip_copy feature is enabled, the raid5/raid6 always causes the mongo DB checksum error in degraded mode less than one hour. On the contrary, it will never cause this abnormal situation when the skip_copy feature is disabled. Besides, because the skip_copy feature only affects the write action instead of read action, I think it should be the write action in degraded mode while skip_copy feature is enabled cause this bug. Please kindly provide us some help or idea about the root cause and solution.Thanks for the reporting, I'll look at it. In the meaning time, do you have a quick way which I can use to reproduce the issue?
Can't find anything suspicious after checking a while. Can you describe the setup/test in detail? like if there is sync running, IO error?