L_PTE_MT_BUFFERABLE / device ordered memory
From: Prof. Michael Taylor <hidden>
Date: 2022-12-30 22:25:45
Hi, Apologies in advance if I have missed an ages old thread on this. And apologies for the length of the description. I am trying to tune the memory mapped I/O performance of a ZYNQ 7000 with an ARM A7 core running Linux. From what I can observe (in the phys_mem_access_prot function in mmu.c), the default for a memory range that has not been given in the device tree is "strongly ordered", which means that the ZYNQ core will not proceed on to the next such memory request until the previous one has fully completed. This has very sub-optimal performance, requiring on average 24 cycle per access overhead. I believe this corresponds to the setting pgprot_noncached (and then to L_PTE_MT_UNCACHED) in the kernel. The ARM architecture, however provides for another setting in the page table entry of "device ordering", which maintains ordering and quantity of requests going out to the device, without pausing the ARM core. In various Xilinx forum posts, it has been confirmed that in the baremetal OS option, that setting the value of the ARM page table TEX and C B fields to 000, 0, 1 respectively, that the performance is greatly improved (maybe 4 cycles per access). Q1. My goal is to unlock this functionality in the Linux kernel. Any best practices? (Below is what I tried/figured out.) Looking at the phys_mem_access_prot function, I therefore concluded that perhaps I should map in the memory location using the device tree, as reserved, and this would cause phys_mem_access_prot to select pgprot_writecombine in the kernel. After doing this successfully, I noticed a great improvement in performance, but also that only a small fraction of transactions in my test case were actually making it out to the I/O device. The test case was writing a series of zeros to the same I/O address, which corresponds to a FIFO, so I really want to see all of the zeros. Looking at the logic analyzer, I saw that the processor was optimizing away the repeated zero writes, and that the AWCACHE field on the AXI bus was set to 3. This was quite surprising to me, as these fields suggest that the PTE is, per the ARM docs (https://developer.arm.com/documentation/ihi0022/c/Additional-Control-Information/Cache-support), cacheable and bufferable, rather than just bufferable. Diving deeper into the kernel, I see that in proc-macros.S, in marv6_mt_table, the L_PTE_MT_BUFFERABLE entry is set to PTE_EXT_TEX(1) (i.e TEX,C,B = 001,0,0) which per https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/Protected-Memory-System-Architecture--PMSA-/Memory-region-attributes/C--B--and-TEX-2-0--encodings is listed as "Normal memory", but with out and inner regions given as non-cacheable. I would have expected PTE_BUFFERABLE (i.e. TEX,C,B=000,0,1). Also looking at proc-v7-2level.S, I see that BUFFERABLE is defined as TR=10, IR=00, OR=00, where TR memory type (per https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/System-Control-Registers-in-a-VMSA-implementation/VMSA-System-control-registers-descriptions--in-register-order/PRRR--Primary-Region-Remap-Register--VMSA?lang=en) is defined as 00=strongly-ordered, 01=device, 10=normal memory. So I would have expected 01=device memory. So my conclusion is that pgprot_writecombine is not what I am looking for, since not only does it buffering and combine writes into packets, it also eliminates writes to the same address. Q2. What is the history behind using strong-ordering instead of device-ordering for I/O writes? And why is the write-combining setting mapping to "Normal Memory" rather than device memory? And why does mmu.c not provide a mechanism for accessing device-ordering (or does it)? Thanks! Michael _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel