Re: [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs)
From: Andrew Morton <akpm@linux-foundation.org>
Date: 2021-10-15 20:28:09
Also in:
linux-fsdevel, lkml
From: Andrew Morton <akpm@linux-foundation.org>
Date: 2021-10-15 20:28:09
Also in:
linux-fsdevel, lkml
On Thu, 14 Oct 2021 12:16:09 -0700 Yang Shi [off-list ref] wrote:
When discussing the patch that splits page cache THP in order to offline the poisoned page, Noaya mentioned there is a bigger problem [1] that prevents this from working since the page cache page will be truncated if uncorrectable errors happen. By looking this deeper it turns out this approach (truncating poisoned page) may incur silent data loss for all non-readonly filesystems if the page is dirty. It may be worse for in-memory filesystem, e.g. shmem/tmpfs since the data blocks are actually gone. To solve this problem we could keep the poisoned dirty page in page cache then notify the users on any later access, e.g. page fault, read/write, etc. The clean page could be truncated as is since they can be reread from disk later on. The consequence is the filesystems may find poisoned page and manipulate it as healthy page since all the filesystems actually don't check if the page is poisoned or not in all the relevant paths except page fault. In general, we need make the filesystems be aware of poisoned page before we could keep the poisoned page in page cache in order to solve the data loss problem.
Is the "RFC" still accurate, or might it be an accidental leftover? I grabbed this series as-is for some testing, but I do think it wouild be better if it was delivered as two separate series - one series for the -stable material and one series for the 5.16-rc1 material.