Re: [v3 0/9] parallelized "struct page" zeroing
From: Michal Hocko <mhocko@kernel.org>
Date: 2017-06-01 08:46:15
Also in:
linux-mm, linux-s390, lkml, sparclinux
From: Michal Hocko <mhocko@kernel.org>
Date: 2017-06-01 08:46:15
Also in:
linux-mm, linux-s390, lkml, sparclinux
On Wed 31-05-17 23:35:48, Pasha Tatashin wrote:
quoted
OK, so why cannot we make zero_struct_page 8x 8B stores, other arches would do memset. You said it would be slower but would that be measurable? I am sorry to be so persistent here but I would be really happier if this didn't depend on the deferred initialization. If this is absolutely a no-go then I can live with that of course.Hi Michal, This is actually a very good idea. I just did some measurements, and it looks like performance is very good. Here is data from SPARC-M7 with 3312G memory with single thread performance: Current: memset() in memblock allocator takes: 8.83s __init_single_page() take: 8.63s Option 1: memset() in __init_single_page() takes: 61.09s (as we discussed because of membar overhead, memset should really be optimized to do STBI only when size is 1 page or bigger). Option 2: 8 stores (stx) in __init_single_page(): 8.525s! So, even for single thread performance we can double the initialization speed of "struct page" on SPARC by removing memset() from memblock, and using 8 stx in __init_single_page(). It appears we never miss L1 in __init_single_page() after the initial 8 stx.
OK, that is good to hear and it actually matches my understanding that writes to a single cacheline should add an overhead. Thanks! -- Michal Hocko SUSE Labs