Re: [PATCH bpf-next v4 00/30] bpf: switch to memcg-based memory accounting

From: Roman Gushchin <hidden>
Date: 2020-08-21 22:21:13
Also in: bpf, linux-mm, lkml

On Fri, Aug 21, 2020 at 08:01:04AM -0700, Roman Gushchin wrote:

Currently bpf is using the memlock rlimit for the memory accounting.
This approach has its downsides and over time has created a significant
amount of problems:

1) The limit is per-user, but because most bpf operations are performed
   as root, the limit has a little value.

2) It's hard to come up with a specific maximum value. Especially because
   the counter is shared with non-bpf users (e.g. memlock() users).
   Any specific value is either too low and creates false failures
   or too high and useless.

3) Charging is not connected to the actual memory allocation. Bpf code
   should manually calculate the estimated cost and precharge the counter,
   and then take care of uncharging, including all fail paths.
   It adds to the code complexity and makes it easy to leak a charge.

4) There is no simple way of getting the current value of the counter.
   We've used drgn for it, but it's far from being convenient.

5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
   a function to "explain" this case for users.

In order to overcome these problems let's switch to the memcg-based
memory accounting of bpf objects. With the recent addition of the percpu
memory accounting, now it's possible to provide a comprehensive accounting
of the memory used by bpf programs and maps.

This approach has the following advantages:
1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
   a better control over memory usage by different workloads. Of course, it
   requires enabled cgroups and kernel memory accounting and properly configured
   cgroup tree, but it's a default configuration for a modern Linux system.

2) The actual memory consumption is taken into account. It happens automatically
   on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
   performed automatically on releasing the memory. So the code on the bpf side
   becomes simpler and safer.

3) There is a simple way to get the current value and statistics.

In general, if a process performs a bpf operation (e.g. creates or updates
a map), it's memory cgroup is charged. However map updates performed from
an interrupt context are charged to the memory cgroup which contained
the process, which created the map.

Providing a 1:1 replacement for the rlimit-based memory accounting is
a non-goal of this patchset. Users and memory cgroups are completely
orthogonal, so it's not possible even in theory.
Memcg-based memory accounting requires a properly configured cgroup tree
to be actually useful. However, it's the way how the memory is managed
on a modern Linux system.


The patchset consists of the following parts:
1) an auxiliary patch by Johanness, which adds an ability to charge
   a custom memory cgroup from an interrupt context
2) memcg-based accounting for various bpf objects: progs and maps
3) removal of the rlimit-based accounting
4) removal of rlimit adjustments in userspace samples

As a note, I've resent the first patch from the series as a standalone
patch to linux-mm@, because a similar change is required by other non-related
patchset. This should avoid further merge conflicts.

I did some renamings in the patch, so v5 of this patchset is expected.
Please, don't merge v4. Feedback is highly appreciated though.

Thanks!

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help