Re: [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key
From: Andrii Nakryiko <hidden>
Date: 2022-04-13 04:10:23
Also in:
bpf
On Fri, Apr 8, 2022 at 8:08 PM Hou Tao [off-list ref] wrote:
Hi, On 4/7/2022 1:38 AM, Andrii Nakryiko wrote:quoted
On Thu, Mar 31, 2022 at 5:04 AM Hou Tao [off-list ref] wrote:quoted
Hi, The initial motivation for the patchset is due to the suggestion of Alexei. During the discuss of supporting of string key in hash-table, he saw the space efficiency of ternary search tree under our early test and suggest us to post it as a new bpf map [1]. Ternary search tree is a special trie where nodes are arranged in a manner similar to binary search tree, but with up to three children rather than two. The three children correpond to nodes whose value is less than, equal to, and greater than the value of current node respectively. In ternary search tree map, only the valid content of string is saved. The trailing null byte and unused bytes after it are not saved. If there are common prefixes between these strings, the prefix is only saved once. Compared with other space optimized trie (e.g. HAT-trie, succinct trie), the advantage of ternary search tree is simple and being writeable. Below are diagrams for ternary search map when inserting hello, he, test and tea into it: 1. insert "hello" [ hello ] 2. insert "he": need split "hello" into "he" and "llo" [ he ] | * | [ llo ] 3. insert "test": add it as right child of "he" [ he ] | *-------x | | [ llo ] [ test ] 5. insert "tea": split "test" into "te" and "st", and insert "a" as left child of "st" [ he ] | x------*-------x | | | [ ah ] [ llo ] [ te ] | * | [ st ] | x----* | [ a ] As showed in above diagrams, the common prefix between "test" and "tea" is "te" and it only is saved once. Also add benchmarks to compare the memory usage and lookup performance between ternary search tree and hash table. When the common prefix is lengthy (~192 bytes) and the length of suffix is about 64 bytes, there are about 2~3 folds memory saving compared with hash table. But the memory saving comes at prices: the lookup performance of tst is about 2~3 slower compared with hash table. See more benchmark details on patch #2. Comments and suggestions are always welcome.Have you heard and tried qp-trie ([0]) by any chance? It is elegant and simple data structure. By all the available benchmarks it handily beats Red-Black trees in terms of memory usage and performance (though it of course depends on the data set, just like "memory compression" for ternary tree of yours depends on large set of common prefixes). qp-trie based BPF map seems (at least on paper) like a better general-purpose BPF map that is dynamically sized (avoiding current HASHMAP limitations) and stores keys in sorted order (and thus allows meaningful ordered iteration *and*, importantly for longest prefix match tree, allows efficient prefix matches). I did a quick experiment about a month ago trying to replace libbpf's internal use of hashmap with qp-trie for BTF string dedup and it was slightly slower than hashmap (not surprisingly, though, because libbpf over-sizes hashmap to avoid hash collisions and long chains in buckets), but it was still very decent even in that scenario. So I've been mulling the idea of implementing BPF map based on qp-trie elegant design and ideas, but can't find time to do this.I have heard about it when check the space efficient of HAT trie [0], because qp-trie needs to save the whole string key in the leaf node and its space efficiency can not be better than ternary search tree for strings with common prefix, so I did not consider about it. But I will do some benchmarks to check the lookup performance and space efficiency of qp-trie and tst for string with common prefix and strings without much common prefix. If qp-trie is better, I think I can take the time to post it as a bpf map if you are OK with that.
You can probably always craft a data set where prefix sharing is so prevalent that space savings are very significant. But I think for a lot of real-world data it won't be as extreme and qp-trie might be very comparable (if not more memory-efficient) due to very compact node layout (which was the point of qp-trie). So I'd be really curious to see some comparisons. Would be great if you can try both!
quoted
This prefix sharing is nice when you have a lot of long common prefixes, but I'm a bit skeptical that as a general-purpose BPF data structure it's going to be that beneficial. 192 bytes of common prefixes seems like a very unusual dataset :)Yes. The case with common prefix I known is full file path.quoted
More specifically about TST implementation in your paches. One global per-map lock I think is a very big downside. We have LPM trie which is very slow in big part due to global lock. It might be possible to design more granular schema for TST, but this whole in-place splitting logic makes this harder. I think qp-trie can be locked in a granular fashion much more easily by having a "hand over hand" locking: lock parent, find child, lock child, unlock parent, move into child node. Something like that would be more scalable overall, especially if the access pattern is not focused on a narrow set of nodes.Yes. The global lock is a problem but the splitting is not in-place. I will try to figure out whether the lock can be more scalable after the benchmark test between qp-trie and tst.
Great, looking forward!
Regards, Tao [0]: https://github.com/Tessil/hat-triequoted
Anyways, I love data structures and this one is an interesting idea. But just my few cents of "production-readiness" for general-purpose data structures for BPF. [0] https://dotat.at/prog/qp/README.htmlquoted
Regards, Tao [1]: https://lore.kernel.org/bpf/CAADnVQJUJp3YBcpESwR3Q1U6GS1mBM=Vp-qYuQX7eZOaoLjdUA@mail.gmail.com/ (local) Hou Tao (2): bpf: Introduce ternary search tree for string key selftests/bpf: add benchmark for ternary search tree map include/linux/bpf_types.h | 1 + include/uapi/linux/bpf.h | 1 + kernel/bpf/Makefile | 1 + kernel/bpf/bpf_tst.c | 411 +++++++++++++++++ tools/include/uapi/linux/bpf.h | 1 + tools/testing/selftests/bpf/Makefile | 5 +- tools/testing/selftests/bpf/bench.c | 6 + .../selftests/bpf/benchs/bench_tst_map.c | 415 ++++++++++++++++++ .../selftests/bpf/benchs/run_bench_tst.sh | 54 +++ tools/testing/selftests/bpf/progs/tst_bench.c | 70 +++ 10 files changed, 964 insertions(+), 1 deletion(-) create mode 100644 kernel/bpf/bpf_tst.c create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c -- 2.31.1.