Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
From: Vlastimil Babka <hidden>
Date: 2022-11-16 09:09:05
Also in:
kvm, linux-crypto, linux-mm, lkml
On 11/15/22 19:15, Kalra, Ashish wrote:
On 11/15/2022 11:24 AM, Kalra, Ashish wrote:quoted
Hello Vlastimil, On 11/15/2022 9:14 AM, Vlastimil Babka wrote:quoted
Cc'ing memory failure folks, the beinning of this subthread is here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F3a51840f6a80c87b39632dc728dbd9b5dd444cd7.1655761627.git.ashish.kalra%40amd.com%2F&data=05%7C01%7Cashish.kalra%40amd.com%7C944b59f239c541a52ac808dac71c2089%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638041220947600149%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=do9zzyMlAErkKx5rguqnL2GoG4lhsWHDI74zgwLWaZU%3D&reserved=0 On 11/15/22 00:36, Kalra, Ashish wrote:quoted
Hello Boris, On 11/2/2022 6:22 AM, Borislav Petkov wrote:quoted
On Mon, Oct 31, 2022 at 04:58:38PM -0500, Kalra, Ashish wrote:quoted
if (snp_lookup_rmpentry(pfn, &rmp_level)) { do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS); return RMP_PF_RETRY;Does this issue some halfway understandable error message why the process got killed?quoted
Will look at adding our own recovery function for the same, but that will again mark the pages as poisoned, right ?Well, not poisoned but PG_offlimits or whatever the mm folks agree upon. Semantically, it'll be handled the same way, ofc.Added a new PG_offlimits flag and a simple corresponding handler for it.One thing is, there's not enough page flags to be adding more (except aliases for existing) for cases that can avoid it, but as Boris says, if using alias to PG_hwpoison it depends what will become confused with the actual hwpoison.quoted
But there is still added complexity of handling hugepages as part of reclamation failures (both HugeTLB and transparent hugepages) and that means calling more static functions in mm/memory_failure.c There is probably a more appropriate handler in mm/memory-failure.c: soft_offline_page() - this will mark the page as HWPoisoned and also has handling for hugepages. And we can avoid adding a new page flag too. soft_offline_page - Soft offline a page. Soft offline a page, by migration or invalidation, without killing anything. So, this looks like a good option to call soft_offline_page() instead of memory_failure() in case of failure to transition the page back to HV/shared state via SNP_RECLAIM_CMD and/or RMPUPDATE instruction.So it's a bit unclear to me what exact situation we are handling here. The original patch here seems to me to be just leaking back pages that are unsafe for further use. soft_offline_page() seems to fit that scenario of a graceful leak before something is irrepairably corrupt and we page fault on it. But then in the thread you discus PF handling and killing. So what is the case here? If we detect this need to call snp_leak_pages() does it mean: a) nobody that could page fault at them (the guest?) is running anymore, we are tearing it down, we just can't reuse the pages further on the hostThe host can page fault on them, if anything on the host tries to write to these pages. Host reads will return garbage data.quoted
- seem like soft_offline_page() could work, but maybe we could just put the pages on some leaked lists without special page? The only thing that should matter is not to free the pages to the page allocator so they would be reused by something else. b) something can stil page fault at them (what?) - AFAIU can't be resolved without killing something, memory_failure() might limit the damageAs i mentioned above, host writes will cause RMP violation page fault.And to add here, if its a guest private page, then the above fault cannot be resolved, so the faulting process is terminated.
BTW would this not be mostly resolved as part of rebasing to UPM? - host will not have these pages mapped in the first place (both kernel directmap and qemu userspace) - guest will have them mapped, but I assume that the conversion from private to shared (that might fail?) can only happen after guest's mappings are invalidated in the first place?
Thanks, Ashishquoted
quoted
quoted
quoted
quoted
Still waiting for some/more feedback from mm folks on the same.Just send the patch and they'll give it. Thx.