“The trace monoid or free partially commutative monoid is a monoid of traces.”
@snacks i have no idea if there’s still desktop cpus with non-trace instruction caches
@snacks @yew zen has clustered decoders (that’s something AMD did with steamroller as well) and zen 5 has two fetch paths for them as well. as long as everything fits into the instruction cache, for 8-byte instructions, excavator actually gets higher throughput than zen 5 and also benefits from SMT, but not as much as zen 5 where throughput doubles with 2 threads.
4-wide instruction decoder too slow? frontend working too hard? literally just add a second cluster lol
having two decode clusters also means zen 5 doesn’t rely quite as heavily on its op cache as zen 4 did, at least for SMT workloads, but switching between instruction cache and op cache mode still hurts performance a lot. usually that’s not a problem because the op cache is massive (6k entries iirc). there are cases where the op cache nearly doubles execution throughput