Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation

quoted

On Thu, Jan 28, 2021 at 07:28:57AM -0800, James Bottomley wrote:
On Thu, 2021-01-28 at 14:01 +0100, Michal Hocko wrote:
On Thu 28-01-21 11:22:59, Mike Rapoport wrote:
[...]
One of the major pushbacks on the first RFC [1] of the concept was
about the direct map fragmentation. I tried really hard to find
data that shows what is the performance difference with different
page sizes in the direct map and I didn't find anything.

So presuming that large pages do provide advantage the first
implementation of secretmem used PMD_ORDER allocations to amortise
the effect of the direct map fragmentation and then handed out 4k
pages at each fault. In addition there was an option to reserve a
finite pool at boot time and limit secretmem allocations only to
that pool.

At some point David suggested to use CMA to improve overall
flexibility [3], so I switched secretmem to use CMA.

Now, with the data we have at hand (my benchmarks and Intel's
report David mentioned) I'm even not sure this whole pooling even
required.
I would still like to understand whether that data is actually
representative. With some underlying reasoning rather than I have run
these XYZ benchmarks and numbers do not look terrible.
My theory, and the reason I made Mike run the benchmarks, is that our
fear of TLB miss has been alleviated by CPU speculation advances over
the years.  You can appreciate this if you think that both Intel and
AMD have increased the number of levels in the page table to
accommodate larger virtual memory size 5 instead of 3.  That increases
the length of the page walk nearly 2x in a physical system and even
more in a virtual system.  Unless this were massively optimized,
systems would have slowed down significantly.  Using 2M pages only
eliminates one level and 2G pages eliminates 2, so I theorized that
actually fragmentation wouldn't be the significant problem we once
thought it was and asked Mike to benchmark it.

The benchmarks show that indeed, it isn't a huge change in the data TLB
miss time, I suspect because data is nicely continuous nowadays and the
prediction that goes into the CPU optimizations quite easy.  ITLB
fragmentation actually seems to be quite a bit worse, likely because we
still don't have branch prediction down to an exact science.
Another thing is that normally useful work done by userspace so data
accesses are dominated by userspace and any change in dTLB miss rate for
kernel data accesses is only a small fraction of all misses.

James

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help