Thread (23 messages) 23 messages, 3 authors, 2024-05-31

Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

From: John Johansen <john.johansen@canonical.com>
Date: 2024-02-09 17:33:34
Also in: lkml

On 2/6/24 20:40, Neeraj Upadhyay wrote:
Gentle ping.
  
John,

Could you please confirm that:

a. The AppArmor refcount usage described in the RFC is correct?
b. Approach taken to fix the scalability issue is valid/correct?
Hi Neeraj,

I know your patchset has been waiting on review for a long time.
Unfortunately I have been very, very busy lately. I will try to
get to it this weekend, but I can't promise that I will be able
to get the review fully done.

john

On 1/10/2024 4:41 PM, Neeraj Upadhyay wrote:
quoted
Problem Statement
=================

Nginx performance testing with Apparmor enabled (with nginx
running in unconfined profile), on kernel versions 6.1 and 6.5
show significant drop in throughput scalability, when Nginx
workers are scaled to higher number of CPUs across various
L3 cache domains.

Below is one sample data on the throughput scalability loss,
based on results on AMD Zen4 system with 96 CPUs with SMT
core count 2; so, overall, 192 CPUs:

Config      Cache Domains     apparmor=off        apparmor=on
                              scaling eff (%)      scaling eff (%)
8C16T          1                  100%             100%
16C32T         2                   95%              94%
24C48T         3                   94%              93%
48C96T         6                   92%              88%
96C192T        12                  85%              68%

If we look at above data, there is a significant drop in
scaling efficiency, when we move to 96 CPUs/192 SMT threads.

Perf tool shows most of the contention coming from below
6.56%     nginx  [kernel.vmlinux]      [k] apparmor_current_getsecid_subj
6.22%     nginx  [kernel.vmlinux]      [k] apparmor_file_open

The majority of the CPU cycles is found to be due to memory contention
in atomic_fetch_add and atomic_fetch_sub operations from kref_get() and
kref_put() operations on label.

Commit 2516fde1fa00 ("apparmor: Optimize retrieving current task secid"),
from 6.7 alleviates the issue to an extent, but not completely:

Config      Cache Domains     apparmor=on        apparmor=on (patched)
                              scaling eff (%)      scaling eff (%)
8C16T          1                  100%             100%
16C32T         2                   97%              93%
24C48T         3                   94%              92%
48C96T         6                   88%              88%
96C192T        12                  65%              79%

This adverse impact gets more pronounced when we move to >192 CPUs.
The memory contention and impact increases with high frequency label
update operations and labels are marked stale more frequently.


Label Refcount Management
=========================

Apparmor uses label objects (struct aa_label) to manage refcounts for
below set of objects:

- Applicable profiles
- Namespaces (unconfined profile)
- Other non-profile references

These label references are acquired on various apparmor lsm hooks,
on operations such as file open, task kill operations, socket bind,
and other file, socket, misc operations which use current task's cred,
when the label for the current cred, has been marked stale. This is
done to check these operations against the set of allowed operations
for the task performing them.

Use Percpu refcount for ref management?
=======================================

The ref put operations (percpu_ref_put()) in percpu refcount,
in active mode, do not check whether ref count has dropped to
0. The users of the percpu_ref need to explicitly invoke
a percpu_ref_kill() operation, to drop the initial reference,
at shutdown paths. After the percpu_ref_kill() operation, ref
switches to atomic mode and any new percpu_ref_put() operation
checks for the drop to 0 case and invokes the release operation
on that label.

Labels are marked stale is situations like profile removal,
profile updates. For a namespace, the unconfined label reference
is dropped when the namespace is destroyed. These points
are potential shutdown points for labels. However, killing
the percpu ref from these points has few issues:

- The label could still be referenced by tasks, which are
   still holding the reference to the now stale label.
   Killing the label ref while these operations are in progress
   will make all subsequent ref-put operations on the stale label
   to be atomic, which would still result in memory contention.
   Also, any new reference to the stale label, which is acquired
   with the elevated refcount will have atomic op contention.

- The label is marked stale using a non-atomic write operation.
   It is possible that new operations do not observe this flag
   and still reference it for quite some time.

- Explicitly tracking the shutdown points might not be maintainable
   at a per label granularity, as there can be various paths where
   label reference could get dropped, such as, before the label has
   gone live - object initialization error paths. Also, tracking
   the shutdown points for labels which reference other labels -
   subprofiles, merged labels requires careful analysis, and adds
   heavy burden on ensuring the memory contention is not introduced
   by these ref kill points.


Proposed Solution
=================

One potential solution to the refcount scalability problem is to
convert the label refcount to a percpu refcount, and manage
the initial reference from kworker context. The kworker
keeps an extra reference to the label and periodically scans
labels and release them if their refcount drops to 0.

Below is the sequence of operations, which shows the refcount
management with this approach:

1. During label initialization, the percpu ref is initialized in
    atomic mode. This is done to ensure that, for cases where the
    label hasn't gone live (->ns isn't assigned), mostly during
    initialization error paths.

2. Labels are switched to percpu mode at various points -
    when a label is added to labelset tree, when a unconfined profile
    has been assigned a namespace.

3. As part of the initial prototype, only the in tree labels
    are managed by the kworker. These labels are added to a lockless
    list. The unconfined labels invoke a percpu_ref_kill() operation
    when the namespace is destroyed.

4. The kworker does a periodic scan of all the labels in the
    llist. It does below sequence of operations:

    a. Enqueue a dummy node to mark the start of scan. This dummy
       node is used as start point of scan and ensures that we
       there is no additional synchronization required with new
       label node additions to the llist. Any new labels will
       be processed in next run of the kworker.

                       SCAN START PTR
                           |
                           v
       +----------+     +------+    +------+    +------+
       |          |     |      |    |      |    |      |
       |   head   ------> dummy|--->|label |--->| label|--->NULL
       |          |     | node |    |      |    |      |
       +----------+     +------+    +------+    +------+


       New label addition:

                             SCAN START PTR
                                  |
                                  v
       +----------+  +------+  +------+    +------+    +------+
       |          |  |      |  |      |    |      |    |      |
       |   head   |--> label|--> dummy|--->|label |--->| label|--->NULL
       |          |  |      |  | node |    |      |    |      |
       +----------+  +------+  +------+    +------+    +------+

     b. Traverse through the llist, starting from dummy->next.
        If the node is a dummy node, mark it free.
        If the node is a label node, do,

        i) Switch the label ref to atomic mode. The ref switch wait
           for the existing percpu_ref_get() and percpu_ref_put()
           operations to complete, by waiting for a RCU grace period.

           Once the switch is complete, from this point onwards, any
           percpu_ref_get(), percpu_ref_put() operations use
           atomic operations.

       ii) Drop the initial reference, which was taken while adding
           the label node to the llist.

      iii) Use a percpu_ref_tryget() increment operation on the
           ref, to see if we dropped the last ref count. if we
           dropped the last count, we remove the node from the llist.

           All of these operations are done inside a RCU critical
           section, to avoid race with the release operations,
           which can potentially trigger, as soon as we drop
           the initial ref count.

       iv) If we didn't drop the last ref, switch back the counter
           to percpu mode.

Using this approach, to move the atomic refcount manipulation out of the
contended paths, there is a significant scalability improvement seen on
nginx test, and scalability efficiency is close to apparmor-off case.

Config      Cache Domains     apparmor=on (percpuref)
                                scaling eff (%)
8C16T          1                  100%
16C32T         2                   96%
24C48T         3                   94%
48C96T         6                   93%
96C192T        12                  90%

Limitations
===========

1. Switching to percpu refcount increases memory size overhead, as
    percpu memory is allocated for all labels.

2. Deferring labels reclaim could potentially result in memory
    pressure, when there are high frequency of label update operations.

3. Percpu refcount uses call_rcu_hurry() to complete switch operations.
    These can impact energy efficiency, due to back to back hurry
    callbacks. Using deferrable workqueue partly mitigates this.
    However, deferring kworker can delay reclaims.

4. Back to back label switches can delay other percpu users, as
    there is a single global switch spinlock used by percpu refcount
    lib.

5. Long running kworker can delay other use cases like system suspend.
    This is mitigated using freezable workqueue and litming node
    scans to a max count.

6. There is a window where label operates is atomic mode, when its
    counter is being checked for zero. This can potentially result
    in high memory contention, during this window which spans RCU
    grace period (plus callback execution). For example, when
    scanning label corresponding to unconfined profile, all
    applications which use unconfined profile would be using
    atomic ref increment and decrement operations.

    There are a few apparoaches which were tried to mitigate this issue:

    a. At a lower time interval, check if scanned label's counter
       has changed since the start of label scan. If there is a change
       in count, terminate the switch to atomic mode. Below shows the
       apparoch using rcuwait.

       static void aa_label_switch_atomic_confirm(struct percpu_ref *label_ref)
       {
          WRITE_ONCE(aa_atomic_switch_complete, true);
          rcuwait_wake_up(&aa_reclaim_rcuwait);
       }

       rcuwait_init(&aa_reclaim_rcuwait);
       percpu_ref_switch_to_atomic(&label->count, aa_label_switch_atomic_confirm);

       atomic_count = percpu_ref_count_read(&label->count);
       do {
         rcuwait_wait_event_timeout(&aa_reclaim_rcuwait,
                            (switch_complete = READ_ONCE(aa_atomic_switch_complete)),
                            TASK_IDLE,
                            msecs_to_jiffies(5));
         if (percpu_ref_count_read(&label->count) != atomic_count)
                 break;
        } while (!READ_ONCE(switch_complete));

        However, this approach does not work, as percpu refcount lib does not
        allow termination of an ongoing switch operation. Also, the counter
        can return to the original value with set of get() and put() operations
        before we check the current value.

    b. Approaches to notify the reclaim kworker from ref get and put operations
       can potentially disturb cache line state between the various CPU
       contexts, which are referncing the label, and can potentially impact
       scalability again.

    c. Swith the label to an immortal percpu ref, while the scan operates
       on the current counter.

       Below is the sequence of operations to do this:

       1. Ensure that both immortal ref and label ref are in percpu mode.
          Reinit the immortal ref in percpu mode.

          Swap percpu and atomic counters of label refcount and immortal ref
	                          percpu-ref
       	                  +-------------------+
       +-------+           |  percpu-ctr-addr1 |
       | label | --------->|-------------------|    +----------------+
       +-------+           |   data            |--->| Atomic counter1|
                           +-------------------+    +----------------+
       +-------+           +-------------------+
       |ImmLbl |---------->|  percpu-ctr-addr2 |    +----------------+
       +-------+           |-------------------|--->| Atomic counter2|
                           |    data           |    +----------------+
                           +-------------------+

           label ->percpu-ctr-addr  = percpu-ctr-addr2
           ImmLbl ->percpu-ctr-addr = percpu-ctr-addr1
           label ->data->count      = Atomic counter2
           ImmLbl ->data->count     = Atomic counter1
   
   
       2. Check the counters collected in immortal label, by switch it
          to atomic mode.

       3. If the count is 0, do,
          a. Switch immortal counter to percpu again, giving it an
             initial count of 1.
          b. Swap the label and immortal counters again. The immortal
             ref now has the counter values from new percpu ref get
             and get operations on the label ref, from the point
             when we did the initial swap operation.
          c. Transfer the percpu counts in immortal ref to atomic
             counter of label percpu refcount.
          d. Kill immortal ref, for reinit on next iteration.
          e. Switch label percpu ref to atomic mode.
          f. If the counter is 1, drop the initial ref.

        4. If the count is not 0, re-swap the counters.
           a. Switch immortal counter to percpu again, giving it an
              initial count of 1.
           b. Swap the label and immortal counters again. The immortal
              ref now has the counter values from new percpu ref get
              and get operations on the label ref, from the point
              when we did the initial swap operation.
           c. Transfer the percpu counts in immortal ref to atomic
              counter of label percpu refcount.
           d. Kill immortal ref, for reinit on next iteration.


           Using this approach, we ensure that, label ref users do not switch
           to atomic mode, while there are active references on the label.
           However, this approach requires multiple percpu ref mode switches
           and adds high overhead and complexity to the scanning code.

Extended/Future Work
====================

1. Look for ways to fix the limitations, as described in the "Limitations"
    section.

2. Generalize the approach to percpu rcuref, which is used for contexts
    where release path uses RCU grace period for release operations. Patch
    7 creates an initial prototype for this.

3. Explore hazard pointers for scalable refcounting of labels.

Highly appreciate any feedback/suggestions on the design approach.

The patches of this patchset introduce following changes:

1.      Documentation of Apparmor Refcount management.

2.      Switch labels to percpu refcount in atomic mode.

         Use percpu refcount for apparmor labels. Initial patch to init
         the percpu ref in atomic mode, to evaluate the potential
         impact of percpuref on top of kref based implementation.

3.      Switch unconfined namespaces refcount to percpu mode.

         Switch unconfined ns labels to percpu mode, and kill the
         initial refcount from namespace destroy path.

4.      Add infrastructure to reclaim percpu labels.

         Add a label reclaim infrastructure for labels which are
         in percpu mode, for managing their inital refcount.

5.      Switch intree labels to percpu mode.

         Use label reclaim infrastruture to manage intree labels.

6.      Initial prototype for optimizing ref switch.

         Prototype for reducing the time window when a label
         scan switches the label ref to atomic mode.

7.      percpu-rcuref: Add basic infrastructure.

         Prototype for Percpu refcounts for objects, which protect
         their object reclaims using RCU grace period.

8.      Switch labels to percpu rcurefcount in unmanaged mode.

         Use percpu rcuref for labels. Start with unmanaged/atomic
         mode.

9.      Switch unconfined and in tree labels to managed ref mode.

         Use percpu mode with manager worker for unconfined and intree
         labels.


------------------------------------------------------------------------

  b/Documentation/admin-guide/LSM/ApparmorRefcount.rst |  351 ++++++++++++++++++++++++++++++++++++++++++++++++++
  b/Documentation/admin-guide/LSM/index.rst            |    1
  b/Documentation/admin-guide/kernel-parameters.txt    |    8 +
  b/include/linux/percpu-rcurefcount.h                 |  115 ++++++++++++++++
  b/include/linux/percpu-refcount.h                    |    2
  b/lib/Makefile                                       |    2
  b/lib/percpu-rcurefcount.c                           |  336 +++++++++++++++++++++++++++++++++++++++++++++++
  b/lib/percpu-refcount.c                              |   93 +++++++++++++
  b/security/apparmor/include/label.h                  |   16 +-
  b/security/apparmor/include/policy.h                 |    8 -
  b/security/apparmor/include/policy_ns.h              |   24 +++
  b/security/apparmor/label.c                          |   11 +
  b/security/apparmor/lsm.c                            |  145 ++++++++++++++++++++
  b/security/apparmor/policy_ns.c                      |    6
  include/linux/percpu-refcount.h                      |    2
  lib/percpu-refcount.c                                |   93 -------------
  security/apparmor/include/label.h                    |   17 +-
  security/apparmor/include/policy.h                   |   56 +++----
  security/apparmor/include/policy_ns.h                |   24 ---
  security/apparmor/label.c                            |   11 -
  security/apparmor/lsm.c                              |  325 ++++++++++++----------------------------------
  security/apparmor/policy_ns.c                        |    8 -
  22 files changed, 1237 insertions(+), 417 deletions(-)

base-commit: ab27740f7665
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help