Re: [PATCH bpf-next v2 4/4] bpf: inet_diag: Dump bpf_sk_storages in inet_diag_dump()
From: Alexei Starovoitov <hidden>
Date: 2020-02-28 02:58:08
Also in:
bpf
On Thu, Feb 27, 2020 at 3:45 PM Daniel Borkmann [off-list ref] wrote:
On 2/27/20 2:34 AM, Martin KaFai Lau wrote:quoted
On Wed, Feb 26, 2020 at 06:21:33PM +0100, Daniel Borkmann wrote:quoted
On 2/26/20 12:04 AM, Martin KaFai Lau wrote:quoted
This patch will dump out the bpf_sk_storages of a sk if the request has the INET_DIAG_REQ_SK_BPF_STORAGES nlattr. An array of SK_DIAG_BPF_STORAGE_REQ_MAP_FD can be specified in INET_DIAG_REQ_SK_BPF_STORAGES to select which bpf_sk_storage to dump. If no map_fd is specified, all bpf_sk_storages of a sk will be dumped. bpf_sk_storages can be added to the system at runtime. It is difficult to find a proper static value for cb->min_dump_alloc. This patch learns the nlattr size required to dump the bpf_sk_storages of a sk. If it happens to be the very first nlmsg of a dump and it cannot fit the needed bpf_sk_storages, it will try to expand the skb by "pskb_expand_head()". Instead of expanding it in inet_sk_diag_fill(), it is expanded at a sleepable context in __inet_diag_dump() so __GFP_DIRECT_RECLAIM can be used. In __inet_diag_dump(), it will retry as long as the skb is empty and the cb->min_dump_alloc becomes larger than before. cb->min_dump_alloc is bounded by KMALLOC_MAX_SIZE. The min_dump_alloc is also changed from 'u16' to 'u32' to accommodate a sk that may have a few large bpf_sk_storages. The updated cb->min_dump_alloc will also be used to allocate the skb in the next dump. This logic already exists in netlink_dump(). Here is the sample output of a locally modified 'ss' and it could be made more readable by using BTF later: [root@arch-fb-vm1 ~]# ss --bpf-map-id 14 --bpf-map-id 13 -t6an 'dst [::1]:8989' State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess ESTAB 0 0 [::1]:51072 [::1]:8989 bpf_map_id:14 value:[ 3feb ] bpf_map_id:13 value:[ 3f ] ESTAB 0 0 [::1]:51070 [::1]:8989 bpf_map_id:14 value:[ 3feb ] bpf_map_id:13 value:[ 3f ] [root@arch-fb-vm1 ~]# ~/devshare/github/iproute2/misc/ss --bpf-maps -t6an 'dst [::1]:8989' State Recv-Q Send-Q Local Address:Port Peer Address:Port Process ESTAB 0 0 [::1]:51072 [::1]:8989 bpf_map_id:14 value:[ 3feb ] bpf_map_id:13 value:[ 3f ] bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ] ESTAB 0 0 [::1]:51070 [::1]:8989 bpf_map_id:14 value:[ 3feb ] bpf_map_id:13 value:[ 3f ] bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ] Acked-by: Song Liu <redacted> Signed-off-by: Martin KaFai Lau <redacted>Hmm, the whole approach is not too pleasant to be honest. I can see why you need it since the regular sk_storage lookup only takes sock fd as a key and you don't have it otherwise available from outside, but then dumping up to KMALLOC_MAX_SIZE via netlink skb is not a great experience either. :( Also, are we planning to add the BTF dump there in addition to bpftool? Thus resulting in two different lookup APIs and several tools needed for introspection instead of one? :/ Also, how do we dump task local storage maps in future? Does it need a third lookup interface? In your commit logs I haven't read on other approaches and why they won't work; I was wondering, given sockets are backed by inodes, couldn't we have a variant of iget_locked() (minus the alloc_inode() part from there) where you pass in ino number to eventually get to the socket and then dump the map value associated with it the regular way from bpf() syscall?Thanks for the feedback! I think (1) dumping all sk(s) in a system is different from (2) dumping all sk of a bpf_sk_storage_map or lookup a particular sk from a bpf_sk_storage_map.Yeah, it is; I was mostly brain-storming if there is a cleaner way for (1) by having (2)b resolved as an intermediate step (1) can then build on, but seems it's tricky w/o much extra infra. [...]quoted
or the netlink can return map_id only when the max-sized skb cannot fit all the bpf_sk_storages. The userspace then do another syscall to lookup the data from each individual bpf_sk_storage_map and that requires to lookup side support with another key (non-fd). IMO, it is weird and a bit opposite of what bpf_sk_storage should be (fast bpf_sk_storage lookup while holding a sk). The iteration API already holds the sk but instead it is asking the usespace to go back to find out the sk again in order to get the bpf_sk_storages. I think that should be avoided if possible. Regarding i_ino, after looking at sock_alloc() and get_next_ino(), hmmm...is it unique?It would wrap around after 2^32 allocations, so scratch that thought, iget_locked()/find_inode_fast()-based lookup usage must ensure it's unique.quoted
If it is, what is the different usecase between i_ino and sk->sk_cookie?Agree that advantage of reusing diag is that you already hold the sk and are able to iterate through all of them, the ugly part is having to place the value data into netlink as an API along with all other socket data as a, for better or worse, bpf sk map lookup/introspection interface given there is no other way to have a fast and global (non-fd based) id->socket lookup interface that we could reuse atm. I was wondering about sk->sk_cookie as well, but it wouldn't make sense to do a ss dump, get (sk cookie, map_id) from the dump and use that for a bpf() lookup if we need to reiterate the diag tables once again in the background just to get the storage data. And I presume it won't make sense either to reuse the diag's walk as a stand-alone for a bpf sk storage dump interface API via bpf() ... at least from ss tool side it would require a correlation based on sk cookie. Not nice either ... so current approach might indeed be the tradeoff. :/
I agree with all the points above. I've applied this set. I think making ss show it via standard way is necessary. We should continue thinking on other options.