Thread (59 messages) 59 messages, 12 authors, 2016-02-02

Re: Optimizing instruction-cache, more packets at each stage

From: Eric Dumazet <hidden>
Date: 2016-01-18 17:01:52

On Mon, 2016-01-18 at 12:54 +0100, Jesper Dangaard Brouer wrote:
That is very interesting. These kind of icache optimization will then
likely benefit lower-end devices more than high end Intel CPUs :-)

AFAIK the Intel CPUs are masking this icache problem, by having a icache
prefetcher and optimizing how fast the CPU can load/refill from higher
level caches.  Intel CPUs have a lot of HW-logic around this, which the
I assume the smaller CPUs don't.  E.g. quote from Intel Optimization
Reference Manual:

 "The instruction fetch unit (IFU) can fetch up to 16 bytes of aligned
  instruction bytes each cycle from the instruction cache to the
  instruction length decoder (ILD). The instruction queue (IQ) buffers
  the ILD-processed instructions and can deliver up to four instructions
  in one cycle to the instruction decoder."
This does not tell how many core/threads can fetch 16 bytes per cycle.

With more than 36 execution units per socket, single peak performance of
one unit does not reflect what happens when all units are busy and
contend on shared resource.

If we want to properly exploit L1 caches of each execution unit, we need
to split the load in a pipeline. But the number of units depend on
hardware capabilities (like L1 cache size). Something hard to code in a
generic way (linux kernel)

For example, having the same core handling RX and TX interrupts are not
the best choice, especially when TX interrupts have to call expensive
callbacks to upper layers (TCP Small Queues).
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help