Re: Optimizing instruction-cache, more packets at each stage
From: Eric Dumazet <hidden>
Date: 2016-01-18 17:01:52
On Mon, 2016-01-18 at 12:54 +0100, Jesper Dangaard Brouer wrote:
That is very interesting. These kind of icache optimization will then likely benefit lower-end devices more than high end Intel CPUs :-) AFAIK the Intel CPUs are masking this icache problem, by having a icache prefetcher and optimizing how fast the CPU can load/refill from higher level caches. Intel CPUs have a lot of HW-logic around this, which the I assume the smaller CPUs don't. E.g. quote from Intel Optimization Reference Manual: "The instruction fetch unit (IFU) can fetch up to 16 bytes of aligned instruction bytes each cycle from the instruction cache to the instruction length decoder (ILD). The instruction queue (IQ) buffers the ILD-processed instructions and can deliver up to four instructions in one cycle to the instruction decoder."
This does not tell how many core/threads can fetch 16 bytes per cycle. With more than 36 execution units per socket, single peak performance of one unit does not reflect what happens when all units are busy and contend on shared resource. If we want to properly exploit L1 caches of each execution unit, we need to split the load in a pipeline. But the number of units depend on hardware capabilities (like L1 cache size). Something hard to code in a generic way (linux kernel) For example, having the same core handling RX and TX interrupts are not the best choice, especially when TX interrupts have to call expensive callbacks to upper layers (TCP Small Queues).