Through some additional performance measurements I realize some interesting plateau. Two points are of interest, first at ~200 byte and the seconds at 600 byte (the x scale is denoted as DWORD size (uint32_t)
Also quit interesting: the long duration to “warm” the cache, tlb, etc ..
(BTW: we talk about microseconds)