Hi Agner,
I'd like to know your opinion about few things I was thinking about BD's architecture:* What do you think of BD's AGU not being able to issue LS-related instructions like mov r/m? i.e. K10 could issue memory instructions in AGU, whereas BD cannot - and PD the same (just mov r/r and such takes AGU path for renaming, i think). From BD manual, almost no instruction ends up alone in AGU (contrary to K10). it seems to me they moved toward having a fixed max 2 instr throughtput/core, a huge stepdown from previous (ideal) 6. Considering that MOVs are everywhere and decode fast, it seems to me a huge limit to overall IPC.
* The split L2 cache access - do you think they'd do better using a contention mechanism for the whole cache, instead of splitting its access in half?
* Do you think AMD will add a trace cache to fix the bad dual-core decoder throughput like intel did? I cant figure a fix for that (decoding 6 instructions would not work, making two x-1/x-1 decoders would double the first instr. decoder).
* What do you think about the L1D WT choice with higher latency (coupled with a WCC halfaway the L2)? Does it impact much the speed for you? On a last note: I was thinking of BD's IPC - 2ALU+2ALU(+2 FPU but they share LS with ALU..). SB could sustain 4 instr /cycle in loops thanks to the TC, but the BD decoder would likely trounce the IPC to 1,x/core no? Is it the shared decoder the bigger stopper for BD, or the reworked AGU?
Do you think if AMD reworks the front-end for getting a near 2 instr/cycle/core, it will still lack without the ability to parallelize MOVs? Thanks,
Massimo
Hi Agner,
I've seen you updated the instruction table - and it seems different from AMD one! So MOV r/m is issued in AGU... but mov m/r is not??? |