Re: [PATCH] xfs: set aside allocation btree blocks from block reservation

From: Brian Foster <hidden>
Date: 2021-02-19 14:11:17

On Fri, Feb 19, 2021 at 01:24:11PM +1100, Dave Chinner wrote:

On Thu, Feb 18, 2021 at 08:25:20AM -0500, Brian Foster wrote:

quoted

On Thu, Feb 18, 2021 at 11:34:51AM +1100, Dave Chinner wrote:

quoted

On Wed, Feb 17, 2021 at 08:23:39AM -0500, Brian Foster wrote:

...

quoted

Note that the counter uses a small percpu batch size to allow the
allocation paths to keep the primary count accurate enough that the
reservation path doesn't ever need to lock and sum the counter.
Absolute accuracy is not required here, just that the counter
reflects the majority of unavailable blocks so the reservation path
fails first.

And this makes the per-cpu counter scale almost no better than an
simple atomic counter, because a spinlock requires two atomic
operations (lock and unlock). Hence a batch count of 4 only reduces
the atomic op count by half but introduces at lot of extra
complexity. It won't make a difference to the scalability of
workloads that hammer the btree block count because contention on
the internal counter spinlock will occur at close to the same
concurrency rate as would occur on an atomic counter.

Right, but percpu_counter_read_positive() allows a fast read in the
xfs_mod_fdblocks() path. I didn't use an atomic because I was concerned
about introducing overhead in that path. If we're Ok with whatever
overhead an atomic read might introduce (a spin lock in the worst case
for some arches), then I don't mind switching over to that. I also don't

The generic definition of atomic_read() is this:

/**
 * atomic_read - read atomic variable
 * @v: pointer of type atomic_t
 *
 * Atomically reads the value of @v.
 */
#ifndef atomic_read
#define atomic_read(v)  READ_ONCE((v)->counter)
#endif

And the only arch specific implementations (x86 and arm64) both have
the same implementation.

That's the 32-bit variant, FWIW. It looks like the x86-64 and arm64
atomic64 variants are essentially the same, but the generic 64-bit
implementation is:

s64 atomic64_read(const atomic64_t *v)
{
        unsigned long flags;
        raw_spinlock_t *lock = lock_addr(v);
        s64 val;

        raw_spin_lock_irqsave(lock, flags);
        val = v->counter;
        raw_spin_unlock_irqrestore(lock, flags);
        return val;
}

Arm, powerpc, and s390 appear to have custom implementations which I
assume are more efficient than this. x86 has an arch_atomic64_read()
that falls through a maze of directives with at least a couple
underlying implementations. One appears to be atomic64_read_cx8() which
uses the cache line lock prefix thing. It's not clear to me where the
other goes, or if this ever falls back to the generic implementation..

And percpu_counter_read_positive() is:

/*
 * It is possible for the percpu_counter_read() to return a small negative
 * number for some counter which should never be negative.
 *
 */
static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
{
        /* Prevent reloads of fbc->count */
        s64 ret = READ_ONCE(fbc->count);

        if (ret >= 0)
                return ret;
        return 0;
}

IOWs, atomic_read() has lower read side overhead than the percpu
counter. atomic reads do not require spinlocks or even locked memory
accesses - they are only needed on the write side. Hence the only
reason for using a pcp counter over an atomic is that the atomic is
update rate bound....

FWIW, generic atomics don't use spin locks anymore - they use
cmpxchg() which is generally much more efficient than a spin lock,
even when emulated on platforms that don't have native support for
cmpxchg(). And, really, we don't care one bit about performance or
scalability on those niche platforms; all the high CPU count
platforms where scalability matters have native atomic cmpxchg
operations.

quoted

mind defining a new spin lock and explicitly implementing the lockless
read in xfs_mod_fdblocks(), I just thought it was extra code for little
benefit over the percpu counter. Preference?

Spinlocks really hurt scalability in fast paths. atomics are much,
much less hurty, and so the threshold for needing percpu counters is
*much higher* than the threshold for needing lockless algorithms to
avoid catastrophic lock contention.[*]

As shown above, this is why I wanted to avoid introducing a spinlock on
the read side. ;) IOW, the functional behavior I was aiming for was:

update code:
	spin_lock();
	counter += delta;
	spin_unlock();

read code:
	READ_ONCE(counter);

I was initially going to pass a batch size of 0 because performance of
the update side is not important. The use of percpu was not for
scalability reasons at all. It was a somewhat lazy reuse of code to
avoid defining a new spinlock just for this.

In any event, the atomic64 read side is essentially equivalent to this
on x86-64 and arm64. If we're comfortable with the remaining custom
atomic64 implementations for other common arches (ppc64, s390x, x86,
etc.) and simply don't care enough about the additional overhead on the
arches that might fall back to the generic implementation, then that is
good enough reason to me to switch to an atomic...

Brian

Cheers,

Dave.

[*] For example, I just added an atomic counter into
xlog_cil_insert_items() that is incremented on every transaction
commit (provides global ordering of items in the CIL checkpoint
placed on per-cpu lists). With the spinlocks from the CIL commit
path, we increase performance from 700k commits/s to 1.4 million
commits/s on a 32p machine and only increase CPU time in
xlog_cil_insert_items() from 1.8% to 3.3%.

IOWs, the atomic counter has almost zero overhead compared to a set
of critical sections protected by multiple spinlocks that were
consuming crazy amounts of CPU time.

Before, at 700,000 commits/sec:

 -   75.35%     1.83%  [kernel]            [k] xfs_log_commit_cil
    - 46.35% xfs_log_commit_cil
       - 41.54% _raw_spin_lock
          - 67.30% do_raw_spin_lock
               66.96% __pv_queued_spin_lock_slowpath


After, at 1.4 million commits/sec:

-   21.78%     3.32%  [kernel]            [k] xlog_cil_commit
   - 17.32% xlog_cil_commit
      - 9.04% xfs_log_ticket_ungrant
           2.22% xfs_log_space_wake
        2.13% memcpy_erms
      - 1.42% xfs_buf_item_committing
         - 1.44% xfs_buf_item_release
            - 0.63% xfs_buf_unlock
                 0.63% up
              0.55% xfs_buf_rele
        1.06% xfs_inode_item_format
        1.00% down_read
        0.77% up_read
        0.60% xfs_buf_item_format

You can see the atomic cmpxchg loops to update the grant heads in
xfs_log_ticket_ungrant() is now looking like the highest contended
cachelines in the fast path, and the CIL ctx rwsem is showing up on
the profile, too. But there's no spin lock contention, and the
atomics I added to the xlog_cil_commit() fast path aren't having any
adverse impact on CPU usage at this point.

FWIW, the grant head accounting is definitely getting hot here,
as the xfs_trans_reserve side is showing:

     7.35%     7.29%  [kernel]            [k] xlog_grant_add_space 

Which indicates at least 15% of the CPU time on this workload is now
being spend in the grant head accounting loops. Those atomic
counters are where I'd be considering per-cpu counters now as they
are showing signs of contention at ~2.8 million updates/s each.

However, the reason I didn't use per-cpu counters way back when I
made this stuff lockless was that we need an accurate sum of the
grant head space frequently. Hence the summing cost of a per-cpu
counter is just too great for a hot limit accounting path like this,
so I used cmpxchg loops. PCP counters are still too expensive to sum
accurately regularly, so if we want to support a higher transaction
throughput rate we're going to have to get creative here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help