Re: [PATCH net-next v9 06/13] mm: page_frag: reuse existing space for 'size'... | netdev

Re: [PATCH net-next v9 06/13] mm: page_frag: reuse existing space for 'size' and 'pfmemalloc'

From: Alexander H Duyck <hidden>
Date: 2024-07-10 15:28:59
Also in: linux-mm, lkml

On Wed, 2024-07-03 at 20:33 +0800, Yunsheng Lin wrote:

On 2024/7/2 22:55, Alexander H Duyck wrote:

...

quoted

+#define PAGE_FRAG_CACHE_ORDER_MASK		GENMASK(7, 0)
+#define PAGE_FRAG_CACHE_PFMEMALLOC_BIT		BIT(8)
+#define PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT	8
+
+static inline struct encoded_va *encode_aligned_va(void *va,
+						   unsigned int order,
+						   bool pfmemalloc)
+{
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-	__u16 offset;
-	__u16 size;
+	return (struct encoded_va *)((unsigned long)va | order |
+			pfmemalloc << PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT);
 #else
-	__u32 offset;
+	return (struct encoded_va *)((unsigned long)va |
+			pfmemalloc << PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT);
+#endif
+}
+

This is missing any and all protection of the example you cited. If I
want to pass order as a 32b value I can and I can corrupt the virtual
address. Same thing with pfmemalloc. Lets only hope you don't hit an
architecture where a bool is a u8 in size as otherwise that shift is
going to wipe out the value, and if it is a u32 which is usually the
case lets hope they stick to the values of 0 and 1.

I explicitly checked that the protection is not really needed due to
performance consideration:
1. For the 'pfmemalloc' part, it does always stick to the values of 0
   and 1 as below:
https://elixir.bootlin.com/linux/v6.10-rc6/source/Documentation/process/coding-style.rst#L1053

2. For the 'order' part, its range can only be within 0~3.

quoted

+static inline unsigned long encoded_page_order(struct encoded_va *encoded_va)
+{
+#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+	return PAGE_FRAG_CACHE_ORDER_MASK & (unsigned long)encoded_va;
+#else
+	return 0;
+#endif
+}
+
+static inline bool encoded_page_pfmemalloc(struct encoded_va *encoded_va)
+{
+	return PAGE_FRAG_CACHE_PFMEMALLOC_BIT & (unsigned long)encoded_va;
+}
+

My advice is that if you just make encoded_va an unsigned long this
just becomes some FIELD_GET and bit operations.

As above.

The code you mentioned had one extra block of bits that was in it and
had strict protections on what went into and out of those bits. You
don't have any of those protections.

As above, the protection masking/checking is explicitly avoided due
to performance consideration and reasons as above for encoded_va.

But I guess it doesn't hurt to have a VM_BUG_ON() checking to catch
possible future mistake.

quoted

I suggest you just use a long and don't bother storing this as a
pointer.

...

quoted

-
+	remaining = nc->remaining & align_mask;
+	remaining -= fragsz;
+	if (unlikely(remaining < 0)) {

Now this is just getting confusing. You essentially just added an
additional addition step and went back to the countdown approach I was
using before except for the fact that you are starting at 0 whereas I
was actually moving down through the page.

Does the 'additional addition step' mean the additional step to calculate
the offset using the new 'remaining' field? I guess that is the disadvantage
by changing 'offset' to 'remaining', but it also some advantages too:

1. it is better to replace 'offset' with 'remaining', which
   is the remaining size for the cache in a 'page_frag_cache'
   instance, we are able to do a single 'fragsz > remaining'
   checking for the case of cache not being enough, which should be
   the fast path if we ensure size is zoro when 'va' == NULL by
   memset'ing 'struct page_frag_cache' in page_frag_cache_init()
   and page_frag_cache_drain().
2. It seems more convenient to implement the commit/probe API too
   when using 'remaining' instead of 'offset' as those API needs
   the remaining size of the page_frag_cache anyway.

So it is really a trade-off between using 'offset' and 'remaining',
it is like the similar argument about trade-off between allocating
fragment 'countdown' and 'countup' way.

About confusing part, as the nc->remaining does mean how much cache
is left in the 'nc', and nc->remaining does start from
PAGE_FRAG_CACHE_MAX_SIZE/PAGE_SIZE to zero naturally if that was what
you meant by 'countdown', but it is different from the 'offset countdown'
before this patchset as my understanding.

quoted

What I would suggest doing since "remaining" is a negative offset
anyway would be to look at just storing it as a signed negative number.
At least with that you can keep to your original approach and would
only have to change your check to be for "remaining + fragsz <= 0".

Did you mean by using s16/s32 for 'remaining'? And set nc->remaining like
below?
nc->remaining = -PAGE_SIZE or
nc->remaining = -PAGE_FRAG_CACHE_MAX_SIZE

Yes. Basically if we used it as a signed value then you could just work
your way up and do your aligned math still.

For the aligned math, I am not sure how can 'align_mask' be appiled to
a signed value yet. It seems that after masking nc->remaining leans
towards -PAGE_SIZE/-PAGE_FRAG_CACHE_MAX_SIZE instead of zero for
a unsigned value, for example:

nc->remaining = -4094;
nc->remaining &= -64;

It seems we got -4096 for above, is that what we are expecting?

No, you have to do an addition and then the mask like you were before
using __ALIGN_KERNEL.

As it stands I realized your alignment approach in this patch is
broken. You should be aligning the remaining at the start of this
function and then storing it before you call
page_frag_cache_current_va. Instead you do it after so the start of
your block may not be aligned to the requested mask if you have
multiple callers sharing this function or if you are passing an
unaligned size in the request.

quoted

struct page_frag_cache {
        struct encoded_va *encoded_va;

#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <= 32)
        u16 pagecnt_bias;
        s16 remaining;
#else
        u32 pagecnt_bias;
        s32 remaining;
#endif
};

If I understand above correctly, it seems we really need a better name
than 'remaining' to reflect that.

It would effectively work like a countdown. Instead of T - X in this
case it is size - X.

quoted

With that you can still do your math but it becomes an addition instead
of a subtraction.

And I am not really sure what is the gain here by using an addition
instead of a subtraction here.

The "remaining" as it currently stands is doing the same thing so odds
are it isn't too big a deal. It is just that the original code was
already somewhat confusing and this is just making it that much more
complex.

quoted

+		page = virt_to_page(encoded_va);
 		if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
 			goto refill;
 
-		if (unlikely(nc->pfmemalloc)) {
-			free_unref_page(page, compound_order(page));
+		if (unlikely(encoded_page_pfmemalloc(encoded_va))) {
+			VM_BUG_ON(compound_order(page) !=
+				  encoded_page_order(encoded_va));
+			free_unref_page(page, encoded_page_order(encoded_va));
 			goto refill;
 		}
 
 		/* OK, page count is 0, we can safely set it */
 		set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
 
-		/* reset page count bias and offset to start of new frag */
+		/* reset page count bias and remaining of new frag */
 		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
-		offset = 0;
-		if (unlikely(fragsz > PAGE_SIZE)) {
+		nc->remaining = remaining = page_frag_cache_page_size(encoded_va);
+		remaining -= fragsz;
+		if (unlikely(remaining < 0)) {
 			/*
 			 * The caller is trying to allocate a fragment
 			 * with fragsz > PAGE_SIZE but the cache isn't big

I find it really amusing that you went to all the trouble of flipping
the logic just to flip it back to being a countdown setup. If you were
going to bother with all that then why not just make the remaining
negative instead? You could save yourself a ton of trouble that way and
all you would need to do is flip a few signs.

I am not sure I understand the 'a ton of trouble' part here, if 'flipping
a few signs' does save a ton of trouble here, I would like to avoid 'a
ton of trouble' here, but I am not really understand the gain here yet as
mentioned above.

It isn't about flipping the signs. It is more the fact that the logic
has now become even more complex then it originally was. With my work
backwards approach the advantage was that the offset could be updated
and then we just recorded everything and reported it up. Now we have to

I am supposing the above is referring to 'offset countdown' way
before this patchset, right?

quoted

keep a temporary "remaining" value, generate our virtual address and
store that to a temp variable, we can record the new "remaining" value,
and then we can report the virtual address. If we get the ordering on

Yes, I noticed it when coding too, that is why current virtual address is
generated in page_frag_cache_current_va() basing on nc->remaining instead
of the local variable 'remaining' before assigning the local variable
'remaining' to nc->remaining. But I am not sure I understand how using a
signed negative number for 'remaining' will change the above steps. If
not, I still fail to see the gain of using a signed negative number for
'nc->remaining'.

quoted

the steps 2 and 3 in this it will cause issues as we will be pointing
to the wrong values. It is something I can see someone easily messing
up.

Yes, agreed. It would be good to be more specific about how to avoid
the above problem using a signed negative number for 'remaining' as
I am not sure how it can be done yet.

My advice would be to go back to patch 3 and figure out how to do this
re-ordering without changing the alignment behaviour. The old code
essentially aligned both the offset and fragsz by combining the two and
then doing the alignment. Since you are doing a count up setup you will
need to align the remaining, then add fragsz, and then I guess you
could store remaining and then subtract fragsz from your final virtual
address to get back to where the starting offset is actually located.

Basically your "remaining" value isn't a safe number to use for an
offset since it isn't aligned to your starting value at any point.

As far as the negative value, it is more about making it easier to keep
track of what is actually going on. Basically we can use regular
pointer math and as such I suspect the compiler is having to do extra
instructions to flip your value negative before it can combine the
values via something like the LEA (load effective address) assembler
call.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help