Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
From: Joanne Koong <hidden>
Date: 2021-09-29 01:55:01
On 9/28/21 4:54 PM, Andrii Nakryiko wrote:
On Tue, Sep 28, 2021 at 1:56 PM Joanne Koong [off-list ref] wrote:quoted
On 9/28/21 9:21 AM, Alexei Starovoitov wrote:quoted
On Mon, Sep 27, 2021 at 05:36:00PM -0700, Andrii Nakryiko wrote:quoted
On Mon, Sep 27, 2021 at 4:51 PM Alexei Starovoitov [off-list ref] wrote:quoted
On Mon, Sep 27, 2021 at 02:14:09PM -0700, Andrii Nakryiko wrote:quoted
On Mon, Sep 27, 2021 at 9:41 AM Alexei Starovoitov [off-list ref] wrote:quoted
On Fri, Sep 24, 2021 at 04:12:11PM -0700, Andrii Nakryiko wrote:quoted
That's not what I proposed. So let's say somewhere in the kernel we have this variable: static int bpf_bloom_exists = 1; Now, for bpf_map_lookup_elem() helper we pass data as key pointer. If all its hashed bits are set in Bloom filter (it "exists"), we return &bpf_bloom_exists. So it's not a NULL pointer.imo that's too much of a hack.too bad, because this feels pretty natural in BPF code: int my_key = 1234; if (bpf_map_lookup_elem(&my_bloom_filter, &my_key)) { /* handle "maybe exists" */ } else { /* handle "definitely doesn't exist" */ }To summarize this, Andrii it seems like you are proposing two possibilities for passing the ptr to the data as the key: 1. Have the value be NULL (value_size = 0)Yeah, this was my biggest hope (except value is non-NULL, it's just zero-sized so not readable/writable), but that's academic at this point, see below.quoted
2. Have the value be value_size = 1 byte or value_size = 4 bytes in a worst-case scenario For #1 where the value_size is 0 and we return something like &bpf_bloom_exists for bpf_map_lookup_elem() where the key is found, this would still require us to do the following in the syscall layer elsewhere: a) In the syscall layer in map_lookup_elem, add code that will allow value_sizes of 0. This would require another change in bpf_map_copy_value where we have to also check that if the value_size is 0, then we shouldn't copy the resulting ptr of the bpf_map_lookup_elem call to the value ptr (the value ptr isn't allocated since value_size is 0). b) In map_update_elem, add code that allows the user to pass in a NULL / zero-size value. Currently, there exists only support for passing in a NULL/zero-size key (which was added for stack/queue maps that pass in NULL keys); we'd have to add in the equivalent for NULL/zero-size values. We'd also have to modify the verifier to allow bpf_map_update_elem for NULL values (ARG_PTR_TO_UNINIT_MAP_KEY).This "UNINIT_MAP_KEY" part is confusing me because we are talking about *NULL value* (so neither uninitialized nor a key), so I must be missing something important here. I thought it would be ARG_PTR_TO_MAP_VALUE_OR_NULL, but it does suck that other map types would then be allowed NULL where they don't expect to get NULL, I agree.
You're right, this would just be changed to ARG_PTR_TO_MAP_VALUE_OR_NULL; I mis-pasted ARG_PTR_TO_UNINIT_MAP_KEY from a previous draft. Sorry if that caused confusion.
quoted
For #2, from the user-side for bpf_map_update_elem, this now means the user would have to always pass in some dummy 1-byte or 4-byte value since the value_size is no longer 0. This seems like a hacky API Repurposing peek/push/pop (in the scenario where value_size = 0) would avoid the bpf_map_copy_value change in #1a altogether, which was the primary reason for suggesting it. The approach taken in this patchset (where we have the key as NULL, and the value as the ptr to the data) avoids the need to add that infrastructure outlined above for allowing NULL values, since it just rides on top of the changes that were added to the stack/queue map that allows NULL keys.So overall, I agree that all the above will be an unnecessary complication for relatively little gain. Just go with peek/pop/push.quoted
quoted
quoted
quoted
I don't think it fits bitset map. In the bitset the value is zero or one. It always exist. If bloomfilter is not a special map and instead implemented on top of generic bitset with a plain loop in a bpf program then push -> bit_set pop -> bit_clear peek -> bit_test would be a better fit for bitset map. bpf_map_pop_elem() and peek_elem() don't have 'flags' argument. In most cases that would be a blocker, but in this case we can add: .arg3_type = ARG_ANYTHING and ignore it in case of stack/queue. While bitset could use the flags as an additional seed into the hash. So to do a bloomfilter the bpf prog would do: for (i = 0; i < 5; i++) if (bpf_map_peek_elem(map, &value, conver_i_to_seed(i)))I think I'm getting lost in the whole unified bitset + bloom filter design, tbh. In this case, why would you pass the seed to peek()? And what is value here? Is that the value (N bytes) or the bit index (4 bytes?)?In that example where seed is passed to peek(), the context is the hypothetical scenario where the bloom filter is implemented on top of a generic bitset.But then why does *bitset* do the hashing on behalf of the user, that's the confusing bit. But I'll reply to Alexei's email in just a sec.quoted
quoted
The full N byte value, of course. The pure index has the same downsides as hashing helper: - hard to make kernel and user space produce the same hash in all cases - inability to dynamically change max_entries in a clean wayquoted
I assumed that once we have a hashing helper and a bitset map, you'd use that and seed to calculate bit index. But now I'm confused about what this peek operation is doing. I'm sorry if I'm just slow. Overall, I think I agree with Joanne that a separate dedicated Bloom filter map will have simpler and better usability. This bitset + bloom filter generalization seems to just create unnecessary confusion. I don't feel the need for bitset map even more than I didn't feel the need for Bloom filter, given it's even simpler data structure and is totally implementable on either global var array or BPF_MAP_TYPE_ARRAY, if map-in-map and dynamic sizing in mandatory.Not really. For two reasons: - inner array with N 8-byte elements is a slow workaround. map_lookup is not inlined for inner arrays because max_entries will be different. - doing the same hash in user space and the kernel is hard. For example, iproute2 is using socket(AF_ALG) to compute the same hash (program tag) as the kernel. Copy-paste of kernel jhash.h is not possible due to GPL, but, as you pointed out, it's public domain, so user space would need to search a public domain, reimplement jhash and then make sure that it produces the same hash as the kernel. All these trade offs point out the need for dedicated map type (either generalized bitset or true bloomfilter) that does the hashing and can change its size.To ensure we are all aligned on this conversation, here is in more detail what I am intending for the v4 map changes: * A bitset map that also internally functions as a bloom filter if nr_hashes > 0 (where nr_hashes is denoted through the map_extra flags). max_entries will be the requested size of the bitset. Key_size should always be 0.ok, makes sense. max_entries is the number of bytes or bits? Not sure which is better (bytes is more consistent with other uses and allows for bigger bitset/filter, but bits might be more natural for bitset), just bringing this up as it's unclear.
I think number of bits would be more ergonomic for the user in the context of the bitset map. Especially since they will be operating at the granularity of the bit anyways for setting/clearing/checking a specified bit.
quoted
* Add the convenience helpers bool bpf_bitset_clear(map, value); bool bpf_bitset_set(map, value); bool bpf_bitset_test(map, value);It maps one to one to bpf_map_pop_elem(), bpf_map_push_elem(), and bpf_map_peek_elem(), right? The signatures for pop and peek are *identical* (map + value), while push has also extra flags (not a big deal, we have 0 flags in a lot of helpers). So I don't see much value for this (and it actually will be more confusing when bitset is really a bloom filter :) ).
Makes sense! I will not add these convenience helpers in v4
quoted
In the case where nr_hashes == 0 (straight bitset): * For simplicity, value_size for the bitmap should always be a u32. This denotes which index of the bitmap to set/check/clearSGTM.quoted
In the case where nr_hashes > 0 (bloom filter): * The value size can be whatever * Clear/delete is not supportedSounds good as well.