Thread (51 messages) 51 messages, 10 authors, 2021-11-30

Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB

From: David Hildenbrand <hidden>
Date: 2021-11-30 15:53:27
Also in: io-uring, linux-api, lkml

(sorry, was busy working on other stuff)
quoted
That would be giving up on compound pages (hugetlbfs, THP, ...) on any
current Linux system that does not use ZONE_MOVABLE -- which is not
something I am not willing to buy into, just like our customers ;)
So we have ZONE_MOVABLE but users won't use it?
It's mostly used in the memory hot(un)plug context and we'll see growing
usage there in the near future (mostly due to dax/kmem, virtio-mem).

One has to be very careful how to size ZONE_MOVABLE, though, and it's
incompatible with various use cases (even huge pages on some
architectures are not movable and cannot be placed on ZONE_MOVABLE ...).
That's why we barely see it getting used automatically outside of memory
hot(un)plug context or when explicitly setup by the admin for a well
fine-tuned system.
Then why is the solution to push the same kinds of restrictions as
ZONE_MOVABLE on to ZONE_NORMAL?
On any zone except ZONE_DEVICE to be precise. Defragmentation is one of
the main reasons we have pageblocks after all -- besides CMA and page
isolation. If we don't care about de-fragmentation we could just squash
MIGRATE_MOVABLE, MIGRATE_UNMOVABLE, MIGRATE_RECLAIMABLE into a single
type. But after all that's the only thing that provides us with THP in
most setups out there.

Note that some people (IIRC Mel) even proposed to remove ZONE_MOVABLE
and instead have "sticky" MIGRATE_MOVABLE pageblocks, meaning
MIGRATE_MOVABLE pageblocks that cannot be converted to a different type
or stolen from -- which would mimic the same thing as the pageblocks we
essentially have in ZONE_MOVABLE.
 
quoted
See my other mail, the upstream version of my reproducer essentially
shows what FOLL_LONGTERM is currently doing wrong with pageblocks. And
at least to me that's an interesting insight :)
Hmm. To your reproducer it would be nice if we could cgroup control
the # of page blocks a cgroup has pinned. Focusing on # pages pinned
is clearly the wrong metric, I suggested the whole compound earlier,
but your point about the entire page block being ruined makes sense
too.
# pages pinned is part of the story, but yes, "pinned something inside a
pageblocks" is a better metric.

I would think that this might be complicated to track, though ...
especially once we have multiple cgroups pinning inside a single
pageblock. Hm ...
It means pinned pages will have be migrated to already ruined page
blocks the cgroup owns, which is a more controlled version of the
FOLL_LONGTERM migration you have been thinking about.
MIGRATE_UNMOVABLE pageblocks are already ruined. But we'd need some way
to manage/charge pageblocks per cgroup I guess? that sounds very
interesting.
This would effectively limit the fragmentation a hostile process group
can create. If we further treated unmovable cgroup charged kernel
allocations as 'pinned' and routed them to the pinned page blocks it
start to look really interesting. Kill the cgroup, get all your THPs
back? Fragmentation cannot extend past the cgroup?
So essentially any accounted unmovable kernel allocation (e.g., page
tables, secretmem, ... ) would try to be placed on a MIGRATE_UNMOVABLE
pageblock "charged" to the respective cgroup?
ie there are lots of batch workloads that could be interesting there -
wrap the batch in a cgroup, run it, then kill everything and since the
cgroup gives some lifetime clustering to the allocator you get a lot
less fragmentation when the batch is finished, so the next batch gets
more THPs, etc.

There is also sort of an interesting optimization opportunity - many
FOLL_LONGTERM users would be happy to spend more time pinning to get
nice contiguous memory ranges. Might help convince people that the
extra pin time for migrations is worthwhile.
Indeed. And fortunately, huge page users (heavily used in vfio context
and for VMs) wouldn't be affected because they only pin huge pages and
there is nothing to migrate then (well, excluding MIGRATE_CMA and
ZONE_MOVABLE what we have already, of course).
quoted
quoted
Something like io_ring is registering a bulk amount of memory and then
doing some potentially long operations against it.
The individual operations it performs are comparable to O_DIRECT I think
Yes, and O_DIRECT can take 10s's of seconds in troubled cases with IO
timeouts and things.
I might be wrong about O_DIRECT semantics, though. Staring at
fs/io_uring.c I don't really have a clue how they are getting used. I
assume they are getting used for DMA directly.


-- 
Thanks,

David / dhildenb

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help