Conversation

“The trace monoid or free partially commutative monoid is a monoid of traces.”

3
6
12
@yew you're falling for the fp meme?
1
0
1
@yew so basically a stream of instructions where blocks can be reordered? Kinda like some sort of compression scheme for loops with branches ig but sounds hard to implement
1
0
1

@snacks decoded instructions (micro operations)

1
0
1
@yew that's pretty standard for an instruction cache afaik
1
0
0
@yew the article you linked says in its first sentence that it's a specialized instruction cache
1
0
0

@snacks i have no idea if there’s still desktop cpus with non-trace instruction caches

1
0
2
@yew it's a way to help with branch prediction issues so it prob doesn't matter that it's complex to implement yeah, makes sense
1
0
1
@yew or, anything high powered not massively parallel
1
0
2
@yew apparently at least zen uses a regular ass uop cache and does idk what instead. Intel has a uop queue that does loop unrolling internally?
2
0
1
@yew amd somehow manages it in their data caches?
1
0
2
@yew they store some extra data on how everything can branch in their caches and then just decode all the branches because the decoder is too stronk?
2
0
0
@yew this seems wrong, surely i'm misunderstanding something
1
0
0
@yew found a nice pdf with some info on all the modern microarchitectures btw: https://www.agner.org/optimize/microarchitecture.pdf
0
0
0
@yew at least zen 5
The Zen 5 breaks this long-standing bottleneck for the first time with a fetch rate of 32 bytes per clock and a decoding rate of 6 instructions per clock. This fetch and decode rate applies to each side of a 2-way branch when the two branches are decoded simultaneously.
0
0
0
@yew a monoid is a list (or a string), and traces have an independence relation that allows independent letters of your string to be switched around. This represents concurrent operations or something idk
0
0
0

@snacks @yew zen has clustered decoders (that’s something AMD did with steamroller as well) and zen 5 has two fetch paths for them as well. as long as everything fits into the instruction cache, for 8-byte instructions, excavator actually gets higher throughput than zen 5 and also benefits from SMT, but not as much as zen 5 where throughput doubles with 2 threads.

4-wide instruction decoder too slow? frontend working too hard? literally just add a second cluster lol

having two decode clusters also means zen 5 doesn’t rely quite as heavily on its op cache as zen 4 did, at least for SMT workloads, but switching between instruction cache and op cache mode still hurts performance a lot. usually that’s not a problem because the op cache is massive (6k entries iirc). there are cases where the op cache nearly doubles execution throughput

1
0
2
@mia @yew i knew that zen 5 was better with smt than most other architectures but reading up on this stuff is kinda crazy. I thought it was mostly just wider
0
0
2