2007/06/09

AMD quad-core Barcelona laid bare

MPF 2006 No part untouched

By Charlie Demerjian: พุธ 11 ตุลาคม 2549, 17:22
AMD LAID OUT a bit more about Barcelona today at MPF, focusing in on six areas. They talked about SSE128, enhanced IPC, efficient memory use, better caches, virtualization and power management.

In the macro sense of things, Intel is chasing AMD, but on the micro level, the opposite is true. The talk was given by Ben Sander of AMD, you can read it on his nametag if you don't believe me.

[Ben Sander of AMD]

Barcelona is the first native quad-core AMD CPU, commonly but wrongly referred to as K8L. It is the first real new core from the chipmaker in quite a while, not merely a massaged version of what came before. You should be seeing it in Q2/07, although some whispers are now saying Q3.

Other than four cores, the most obvious difference is the new widened SSE instructions. On the pre-Barcelona parts, SSE was done in 64 bit chunks, so if you wanted to do a 128b operation, you needed two passes, possibly more. With the widening of SSE, it should immediately double throughput on SSE instructions. Obviously media operations will benefit, but HPC and FP heavy ops will get a solid kick in the pants too.

In addition to the obvious width change, several less noticeable changes were made to support it. Instruction fetch was upped from 16B/cycle to 32B and support for unaligned load ops was added and cache bandwidth was doubled to support this. Last but not least the FP scheduler was widened from 64 to 128b.

The enhanced IPC is more a case of little improvements adding up to a bigger bang rather than any single thing standing out. The SSE improvements add a bit here and there to start with, and a better branch predictor adds to this. It is bigger and better with a larger history, a dedicated 512 entry indirect predictor, and the return stack is doubled in size.

In the catching up with Intel side of things, added a sideband stack optimizer, out of order load execution and data dependent divide latency. They also upped the TLB to support 1G pages, 48-bit physical addressing, and improved the ITLB and DTLBs. There are also more fastpath instructions and a few more bit manipulation instructions and SSE extensions.

For RAM efficiency, one of the main things they did was make the two memory controllers on the chip act independently, up to Rev G they could only act in lockstep. This lets AMD hit two memory locations at once, potentially a big win for server type apps, but for single users, it's benefit is less clear.

On top of this, they changed the northbridge a lot increasing buffers and adding support for new DRAM types. The Barcelona controller will do FBD if necessary, but the chances of you seeing that are something less than zero. AMD also updated the way paging is done and modified the way write bursting happens.

Additionally, one of the big complaints I hear is prefetch, and that has been comprehensively addressed. They now track positive, negative and non-unit strides, and have a dedicated prefetch buffer. On top of that, Barcelona is much more aggressive in how they fill idle ram cycle and updated the core prefetchers on both the L1D and L1I side.

The newer L3 cache is called a 'victim cache', it sits on top of the existing discrete L2 caches, and is shared among the cores. The big thing here is that if the caches are empty, the request goes to L1 cache. If that fills, the data line is evicted to L2, and when L2 fills, it goes to L3. If a core reads from L3, there are one of two things that can happen. If it is data, the line is moved to L1D, but if it is instructions, the line may be moved to L1I, or it may be copied to L1I and left in L3. The L3 is not exclusive, and the point of this is that code is often shared among the cores while data is far less often. The cache lines have performance hints associated with them that will clue to cores in on the whole copy vs move debate.

That brings us to virtualization, a topic we have covered a lot in our Pacifica articles (1, 2, 3 and 4 )The main advance Barcelona brings is that it turns on Nested Page Table support and a few other things that did not make the cut on the Rev F parts. It is also said to reduce world switch time going to the hypervisor and back by 25%.

One of the last things they did was to break the CPU and northbridge into separate power planes. This will allow the CPUs to be clocked up and down, and volted up and down independently of the Northbridge. This was a major sticking point in the ramp of the previous iterations of the chip, and I expect big dividends here, and it also saves a lot of power. I am also told that they can change the voltage on cores independently, but that is more of a motherboard issue. Since it is not really supported anywhere on the current platforms, don't look for a BIOS option to turn it on, but if it came back in later platform revisions, I would not be overly surprised.

What you have for Barcelona is a CPU that looks a lot like the older revs at the block diagram level, but no part of it is untouched. Some pieces a massively updated, others far less so. In any case, it will add up to significant gains overall, but will it be enough to dethrone Woodcrest 2 core? Stay tuned in the middle of next year for the answer to that.

No comments: