Thread (47 messages) 47 messages, 9 authors, 2019-03-05

Re: [PATCH bpf-next v2 1/7] bpf: implement lookup-free direct value access

From: Yonghong Song <hidden>
Date: 2019-03-01 23:04:58
Also in: bpf


On 3/1/19 11:51 AM, Daniel Borkmann wrote:
On 03/01/2019 06:18 PM, Yonghong Song wrote:
quoted
On 2/28/19 3:18 PM, Daniel Borkmann wrote:
quoted
This generic extension to BPF maps allows for directly loading an
address residing inside a BPF map value as a single BPF ldimm64
instruction.

The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
is a special src_reg flag for ldimm64 instruction that indicates
that inside the first part of the double insns's imm field is a
file descriptor which the verifier then replaces as a full 64bit
address of the map into both imm parts.

For the newly added BPF_PSEUDO_MAP_VALUE src_reg flag, the idea
is similar: the first part of the double insns's imm field is
again a file descriptor corresponding to the map, and the second
part of the imm field is an offset. The verifier will then replace
both imm parts with an address that points into the BPF map value
for maps that support this operation. BPF_PSEUDO_MAP_VALUE is a
distinct flag as otherwise with BPF_PSEUDO_MAP_FD we could not
differ offset 0 between load of map pointer versus load of map's
value at offset 0.

This allows for efficiently retrieving an address to a map value
memory area without having to issue a helper call which needs to
prepare registers according to calling convention, etc, without
needing the extra NULL test, and without having to add the offset
in an additional instruction to the value base pointer.

The verifier then treats the destination register as PTR_TO_MAP_VALUE
with constant reg->off from the user passed offset from the second
imm field, and guarantees that this is within bounds of the map
value. Any subsequent operations are normally treated as typical
map value handling without anything else needed for verification.

The two map operations for direct value access have been added to
array map for now. In future other types could be supported as
well depending on the use case. The main use case for this commit
is to allow for BPF loader support for global variables that
reside in .data/.rodata/.bss sections such that we can directly
load the address of them with minimal additional infrastructure
required. Loader support has been added in subsequent commits for
libbpf library.
The patch version #1 provides a way to replace the load with
immediate (presumably read-only data). This will be good for
the use case like below:

     if (static_variable_kernel_version == V1) {
         /* code here will work for kernel V1 */
         ... access helpers available for V1 ...
     } else if (static_variable_kernel_version == V2) {
         /* code here will work for kernel V2 */
         ... access helpers available for V2 ...
     }

The approach here did not replace the map value access with values from
e.g., readonly section for which libbpf could provide an interface to
fill in data from user.

This may require a little more analysis, e.g.,
     ptr = ld_imm64 from a readonly section
     ...
     *(u32 *)ptr;
     *(u64 *)(ptr + 8);
     ...

Do you think we could do this in kernel verifier or we should
push the whole readonly stuff into user space?
And in your case the static_variable_kernel_version would be determined
at runtime, for example, where you then would want to eliminate all the
other branches, right? Meaning, you'd need a way to turn this into a imm
load such that verifier will detect these dead branches and patch them
Yes, the program will be compiled once and deployed to many hosts with 
different kernel versions. Different hosts may have different kernel
versions. The static_variable_kernel_version is determined
out, which it should already be able to do. How would you mark these
special vars like static_variable_kernel_version such that they have
special treatment from the rest, some sort of builtin? Potentially one
A libbpf API is needed to assign a particular value to a readonly 
section value. For example, a bpf program may look like:

-bash-4.4$ cat g1.c
static volatile const unsigned __kernel_version;
int prog() {
   unsigned kernel_ver = __kernel_version;

   if (kernel_ver == 411)
     return 0;
   else if (kernel_ver == 416)
     return 1;
   return 2;
}
-bash-4.4$ clang -target bpf -O2 -c g1.c 

-bash-4.4$ llvm-readelf -r g1.o

Relocation section '.rel.text' at offset 0x178 contains 1 entries:
     Offset             Info             Type               Symbol's 
Value  Symbol's Name
0000000000000000  0000000500000001 R_BPF_64_64 
0000000000000000 .rodata
-bash-4.4$ llvm-objdump -d g1.o 


g1.o:   file format ELF64-BPF

Disassembly of section .text:
0000000000000000 prog:
        0:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
r1 = 0 ll
        2:       61 11 00 00 00 00 00 00         r1 = *(u32 *)(r1 + 0)
        3:       b7 02 00 00 01 00 00 00         r2 = 1
        4:       15 01 01 00 a0 01 00 00         if r1 == 416 goto +1 
<LBB0_2>
        5:       b7 02 00 00 02 00 00 00         r2 = 2

0000000000000030 LBB0_2:
        6:       b7 00 00 00 00 00 00 00         r0 = 0
        7:       15 01 01 00 9b 01 00 00         if r1 == 411 goto +1 
<LBB0_4>
        8:       bf 20 00 00 00 00 00 00         r0 = r2

0000000000000048 LBB0_4:
        9:       95 00 00 00 00 00 00 00         exit
-bash-4.4$ llvm-readelf -S g1.o
There are 9 section headers, starting at offset 0x1f8:

Section Headers:
   [Nr] Name              Type            Address          Off    Size 
ES Flg Lk Inf Al
   [ 0]                   NULL            0000000000000000 000000 000000 
00      0   0  0
   [ 1] .strtab           STRTAB          0000000000000000 000189 000068 
00      0   0  1
   [ 2] .text             PROGBITS        0000000000000000 000040 000050 
00  AX  0   0  8
   [ 3] .rel.text         REL             0000000000000000 000178 000010 
10      8   2  8
   [ 4] .rodata           PROGBITS        0000000000000000 000090 000004 
00   A  0   0  4
   [ 5] .BTF              PROGBITS        0000000000000000 000094 000019 
00      0   0  1
   [ 6] .BTF.ext          PROGBITS        0000000000000000 0000ad 000020 
00      0   0  1
   [ 7] .llvm_addrsig     LLVM_ADDRSIG    0000000000000000 000188 000001 
00   E  8   0  1
   [ 8] .symtab           SYMTAB          0000000000000000 0000d0 0000a8 
18      1   6  8
Key to Flags:
   W (write), A (alloc), X (execute), M (merge), S (strings), l (large)
   I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
   O (extra OS processing required) o (OS specific), p (processor specific)
-bash-4.4$ llvm-readelf -s g1.o 


Symbol table '.symtab' contains 7 entries:
    Num:    Value          Size Type    Bind   Vis      Ndx Name
      0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
      1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS g1.c
      2: 0000000000000030     0 NOTYPE  LOCAL  DEFAULT    2 LBB0_2
      3: 0000000000000048     0 NOTYPE  LOCAL  DEFAULT    2 LBB0_4
      4: 0000000000000000     4 OBJECT  LOCAL  DEFAULT    4 __kernel_version
      5: 0000000000000000     0 SECTION LOCAL  DEFAULT    4 .rodata
      6: 0000000000000000    80 FUNC    GLOBAL DEFAULT    2 prog
-bash-4.4$

The relocation is for the first insn.
The address is the start of .rodata section, which happens to
match variable __kernel_version (size 4).

The libbpf API can provide a way for user to assign a value to readonly 
section. In this particular case, e.g., on HostA, __kernel_version is
assigned to 416, which means the first 4 bytes of .rodata is modified
to have value 416. Considering this is a generic interface, the API
may look like
   bpf_object__change_readonly_value(const char *var_name, void *val_buf,
     unsigned var_buf_size);
The libbpf will change the value if there is a "var_name" in rodata
section and val_buf_size matches the size in the symbol table.
could get away with doing this from loader side if it's simple enough,
though one thing that would be good to avoid is to duplicate all the
complex branch fixup logic etc that we have in kernel already. Are you
I totally agree that kernel is already able to prune dead codes while 
maintaining correct func/line info. We should do that part in kernel.

Let us look at the byte codes,

0000000000000000 prog:
        0:       r1 = 0 ll
        2:       r1 = *(u32 *)(r1 + 0)
        3:       r2 = 1
        4:       if r1 == 416 goto +1 <LBB0_2>
        5:       r2 = 2

0000000000000030 LBB0_2:
        6:       r0 = 0
        7:       if r1 == 411 goto +1 <LBB0_4>
        8:       r0 = r2

0000000000000048 LBB0_4:
        9:       exit

Here, the goal is to let r1 at insn #2 get the constant.
Do you think we can get it from the kernel? In this particular case,
insn #0, get a romap_ptr with addr of rodata section offset 0,
insn #2, load u32 from romap offset 0, the value is already populated, 
e.g., 416.

The verifier is path sensitive, will need extra care to
perform such transformation in case it is invalid in different paths.
Maybe slightly extension of verifier is able to do this?
Initially we do not need to handle complicated cases. Most global/static
variable accesses are all like
    r1 = #num ll
    r1 = *(type *)(r1 + offset)
If there is branch into the middle of the above pair of insns
and r1 is romap_ptr, it is totally safe to replace the second insn
as r1 = constant which can enable later dead code elimination.
If all read only region access are converted to constants,
"r1 = #num ll" ld_imm64 insns can be removed as well.
thinking to mark these via BTF in some way such that loader does inline
replacement?
I have not thought about BTF. BTF could provide information about insn 
#2 referring to a particular readonly section location. But looks like 
verifier is able to track it as well in the above?

Let us first study whether without BTF is okay. If needed, we can go
through BTF path with compiler assistance.
Thanks,
Daniel
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help