Re: releasing result pages in svc_xprt_release()

From: NeilBrown <hidden>
Date: 2021-02-01 23:28:43

On Mon, Feb 01 2021, Chuck Lever wrote:

quoted

On Jan 31, 2021, at 6:45 PM, NeilBrown [off-list ref] wrote:

On Fri, Jan 29 2021, Chuck Lever wrote:

quoted

What's your opinion?

To form a coherent opinion, I would need to know what that problem is.
I certainly accept that there could be performance problems in releasing
and re-allocating pages which might be resolved by batching, or by copying,
or by better tracking.  But without knowing what hot-spot you want to
cool down, I cannot think about how that fits into the big picture.
So: what exactly is the problem that you see?

The problem is that each 1MB NFS READ, for example, hands 257 pages
back to the page allocator, then allocates another 257 pages. One page
is not terrible, but 510+ allocator calls for every RPC begins to get
costly.

Also, remember that both allocating and freeing a page means an irqsave
spin lock -- that will serialize all other page allocations, including
allocation done by other nfsd threads.

So I'd like to lower page allocator contention and the rate at which
IRQs are disabled and enabled when the NFS server becomes busy, as it
might with several 25 GbE NICs, for instance.

Holding onto the same pages means we can make better use of TLB
entries -- fewer TLB flushes is always a good thing.

I know that the network folks at Red Hat have been staring hard at
reducing memory allocation in the stack for several years. I recall
that Matthew Wilcox recently made similar improvements to the block
layer.

With the advent of 100GbE and Optane-like durable storage, the balance
of memory allocation cost to device latency has shifted so that
superfluous memory allocation is noticeable.


At first I thought of creating a page allocator API that could grab
or free an array of pages while taking the allocator locks once. But
now I wonder if there are opportunities to reduce the amount of page
allocator traffic.

Thanks.  This helps me a lot.

I wonder if there is some low-hanging fruit here.

If I read the code correctly (which is not certain, but what I see does
seem to agree with vague memories of how it all works), we currently do
a lot of wasted alloc/frees for zero-copy reads.

We allocate lots of pages and store the pointers in ->rq_respages
(i.e. ->rq_pages)
Then nfsd_splice_actor frees many of those pages and
replaces the pointers with pointers to page-cache pages.  Then we release
those page-cache pages.

We need to have allocated them, but we don't need to free them.
We can add some new array for storing them, have nfsd_splice_actor move
them to that array, and have svc_alloc_arg() move pages back from the
store rather than re-allocating them.

Or maybe something even more sophisticated where we only move them out
of the store when we actually need them.

Having the RDMA layer return pages when they are finished with might
help.  You might even be able to use atomics (cmpxchg) to handle the
contention.  But I'm not convinced it would be worth it.

I *really* like your idea of a batch-API for page-alloc and page-free.
This would likely be useful for other users, and it would be worth
writing more code to get peak performance - things such as per-cpu
queues of returned pages and so-forth (which presumably already exist).

I cannot be sure that the batch-API would be better than a focused API
just for RDMA -> NFSD.  But my guess is that it would be at least nearly
as good, and would likely get a lot more eyes on the code.

NeilBrown

Attachments

signature.asc [application/pgp-signature] 853 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help