Re: [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key

From: Andrii Nakryiko <hidden>
Date: 2022-07-05 22:37:17
Also in: bpf

On Wed, Jun 8, 2022 at 2:00 AM Hou Tao [off-list ref] wrote:

Hi,

On 4/27/2022 11:57 AM, Andrii Nakryiko wrote:

quoted

On Tue, Apr 26, 2022 at 1:03 AM Hou Tao [off-list ref] wrote:

snip

quoted

I'm biased :) But I like the idea of qp-trie as a general purpose
ordered and dynamically sized BPF map. It makes no assumption about
data being string-like and sharing common prefixes. It can be made to
work just as fine with any array of bytes, making it very suitable as
a generic lookup table map. Note that upstream implementation does
assume zero-terminated strings and no key being a prefix of another
key. But all that can be removed. For fixed-length keys this can never
happen by construction, for variable-length keys (and we'll be able to
support this finally with bpf_dynptr's help very soon), we can record
length of the key in each leaf and use that during comparisons.

Using the trailing zero byte to make sure no key will be a prefix of another is
simple, but I will check whether or not there is other way to make the bytes
array key work out. Alexei had suggest me to use the length of key as part of
key just like bpf_lpm_trie_key does, maybe i can try it first.

Yeah, using key length as part of the key during comparison is what I
meant as well. I didn't mean to aritificially add trailing zero (this
idea doesn't work for arbitrary binary data).

quoted

Also note that qp-trie can be internally used by BPF_MAP_TYPE_LPM_TRIE
very efficiently and speed it up considerable in the process (and
especially to get rid of the global lock).

So if you were to invest in a proper full-featured production
implementation of a BPF map, I'd start with qp-trie. From available
benchmarks it's both faster and more memory efficient than Red-Black
trees, which could be an alternative underlying implementation of such
ordered and "resizable" map.

Recently I tried to add concurrent update and deletion support for qp-trie, and
found out that hand-over-hand lock scheme may don't work for qp-trie update. The
short explanation is that update procedure needs traverse qp-trie twice and the
position tuple got in the first pass may be stale due to concurrent updates
occurred in the second pass. The detailed explanation is shown below.

To reduce space usage for qp-trie, there is no common prefix saved in branch
node, so the update of qp-trie needs to traversing qp-trie to find the most
similar key firstly, comparing with it to get a (index, flag,) tuple for the
leaf key, then traversing qp-trie again to find the insert position by using the
tuple. The problem is that, the position tuple may be stale due to concurrent
updates occurred in different branch. Considering the following case:

When inserting "aa_bind_mount" and "aa_buffers_lock" concurrently into the
following qp-trie. The most similar key for "aa_bind_mount" is leaf X, and for
"aa_buffers_lock" it is leaf Y. The calculated index tuple for both new keys are
the same: (index=3, flag=2). Assuming "aa_bind_mount" is inserted firstly, so
when inserting "aa_buffers_lock", the correct index will be (index=4, flag=1)
instead of (index=3, flag=2) and the result will be incorrect.

branch: index  1 flags 1 bitmap 0x00088
* leaf: a.81577 0
* branch: index  4 flags 1 bitmap 0x00180
* * branch: index  4 flags 2 bitmap 0x02080
* * * leaf: aa_af_perm 1 (leaf X, for aa_bind_mount)
* * branch: index  4 flags 2 bitmap 0x00052
* * * leaf: aa_apply_modes_to_perms 6
* * * leaf: aa_asprint_hashstr 7 (leaf Y, for aa_buffers_lock)

In order to get a correct position tuple, the intuitive solution is adding
common prefix in branch node and letting update procedure to find the insertion
position by comparing with the prefix, so it only needs to traverse qp-trie once
and hand-over-hand locking scheme can work. I plan to work on qp-trie with
common prefix and will update its memory usage and concurrent update/insert
speed in this thread once the demo is ready.  So any more suggestions ?

Yeah, that sucks. I'm not sure I completely understand the common
prefix solution and whether it will still be qp-trie after that, but
if you try that, it would be interesting to learn about the results
you get!

I think in the worst case we'll have to do tree-wide lock, perhaps
maybe having an initial 256-way root node for first byte, and each of
256 subtrees could have their own lock.

Alternatively we can do optimistic lockless lookup (in the hope to
find a match), but if that fails - take tree-wide lock, and perform
the search and insertion again. This will favor lookup hits,
obviously, but hopefully that's going to be a common use case where
keys are mostly matching.

WDYT?

Regards,
Tao

quoted

Thanks for your suggestions. I will give it a try.

Awesome!

quoted

Regards,
Tao

quoted

Regards,
Tao

quoted

This prefix sharing is nice when you have a lot of long common
prefixes, but I'm a bit skeptical that as a general-purpose BPF data
structure it's going to be that beneficial. 192 bytes of common
prefixes seems like a very unusual dataset :)

Yes. The case with common prefix I known is full file path.

quoted

More specifically about TST implementation in your paches. One global
per-map lock I think is a very big downside. We have LPM trie which is
very slow in big part due to global lock. It might be possible to
design more granular schema for TST, but this whole in-place splitting
logic makes this harder. I think qp-trie can be locked in a granular
fashion much more easily by having a "hand over hand" locking: lock
parent, find child, lock child, unlock parent, move into child node.
Something like that would be more scalable overall, especially if the
access pattern is not focused on a narrow set of nodes.

Yes. The global lock is a problem but the splitting is not in-place. I will try
to figure out whether the lock can be more scalable after the benchmark test
between qp-trie and tst.

Great, looking forward!

quoted

Regards,
Tao

[0]: https://github.com/Tessil/hat-trie

quoted

Anyways, I love data structures and this one is an interesting idea.
But just my few cents of "production-readiness" for general-purpose
data structures for BPF.

  [0] https://dotat.at/prog/qp/README.html

quoted

Regards,
Tao

[1]: https://lore.kernel.org/bpf/CAADnVQJUJp3YBcpESwR3Q1U6GS1mBM=Vp-qYuQX7eZOaoLjdUA@mail.gmail.com/ (local)

Hou Tao (2):
  bpf: Introduce ternary search tree for string key
  selftests/bpf: add benchmark for ternary search tree map

 include/linux/bpf_types.h                     |   1 +
 include/uapi/linux/bpf.h                      |   1 +
 kernel/bpf/Makefile                           |   1 +
 kernel/bpf/bpf_tst.c                          | 411 +++++++++++++++++
 tools/include/uapi/linux/bpf.h                |   1 +
 tools/testing/selftests/bpf/Makefile          |   5 +-
 tools/testing/selftests/bpf/bench.c           |   6 +
 .../selftests/bpf/benchs/bench_tst_map.c      | 415 ++++++++++++++++++
 .../selftests/bpf/benchs/run_bench_tst.sh     |  54 +++
 tools/testing/selftests/bpf/progs/tst_bench.c |  70 +++
 10 files changed, 964 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/bpf_tst.c
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh
 create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c

--
2.31.1

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help