Problems with the Architecture
At the heart of both the Xenon and Cell processors is IBMs custom PowerPC based core. Weve discussed this core in our previous articles, but it is best characterized as being quite simple. The core itself is a very narrow 2-issue in-order execution core, featuring a 64KB L1 cache (32K instruction/32K data) and either a 1MB or 512KB L2 cache (for Xenon or Cell, respectively). Supporting SMT, the core can execute two threads simultaneously similar to a Hyper Threading enabled Pentium 4. The Xenon CPU is made up of three of these cores, while Cell features just one.
Each individual core is extremely small, making the 3-core Xenon CPU in the Xbox 360 smaller than a single core 90nm Pentium 4. While we dont have exact die sizes, weve heard that the number is around 1/2 the size of the 90nm Prescott die.
-
Cell's PPE is identical to a single core in Xenon. The die area of the Cell processor is 221 mm^2, note how little space is occupied by the PPE - it is a very simple core.
IBMs pitch to MS was based on the peak theoretical floating point performance-per-dollar that the Xenon CPU would offer, and given MSs focus on cost savings with the Xbox 360, they took the bait.
While MS and Sony have been childishly playing this flops-war, comparing the 1 TFLOPs processing power of the Xenon CPU to the 2 TFLOPs processing power of the Cell, the real-world performance war has already been lost.
Right now, from what weve heard, the real-world performance of the Xenon CPU is about twice that of the 733MHz processor in the first Xbox. Considering that this CPU is supposed to power the Xbox 360 for the next 4 - 5 years, its nothing short of disappointing. To put it in perspective, floating point multiplies are apparently 1/3 as fast on Xenon as on a Pentium 4.
The reason for the poor performance? The very narrow 2-issue in-order core also happens to be very deeply pipelined, apparently with a branch predictor thats not the best in the business. In the end, you get what you pay for, and with such a small core, its no surprise that performance isnt anywhere near the Athlon 64 or Pentium 4 class.
The Cell processor doesnt get off the hook just because it only uses a single one of these horribly slow cores; the SPE array ends up being fairly useless in the majority of situations, making it little more than a waste of die space.
We mentioned before that collision detection is able to be accelerated on the SPEs of Cell, despite being fairly branch heavy. The lack of a branch predictor in the SPEs apparently isnt that big of a deal, since most collision detection branches are basically random and cant be predicted even with the best branch predictor. So not having a branch predictor doesnt hurt, what does hurt however is the very small amount of local memory available to each SPE. In order to access main memory, the SPE places a DMA request on the bus (or the PPE can initiate the DMA request) and waits for it to be fulfilled. From those that have had experience with the PS3 development kits, this access takes far too long to be used in many real world scenarios. It is the small amount of local memory that each SPE has access to that limits the SPEs from being able to work on more than a handful of tasks. While physics acceleration is an important one, there are many more tasks that cant be accelerated by the SPEs because of the memory limitation.
The other point that has been made is that even if you can offload some of the physics calculations to the SPE array, the Cells PPE ends up being a pretty big bottleneck thanks to its overall lackluster performance. Its akin to having an extremely fast GPU but without a fast CPU to pair it up with.
-------------------------------------------------
What About Multithreading?
We of course asked the obvious question: would game developers rather have 3 slow general purpose cores, or one of those cores paired with an array of specialized SPEs? The response was unanimous, everyone we have spoken to would rather take the general purpose core approach.
Citing everything from ease of programming to the limitations of the SPEs we mentioned previously, the Xbox 360 appears to be the more developer-friendly of the two platforms according to the cross-platform developers we've spoken to. Despite being more developer-friendly, the Xenon CPU is still not what developers wanted.
The most ironic bit of it all is that according to developers, if either manufacturer had decided to use an Athlon 64 or a Pentium D in their next-gen console, they would be significantly ahead of the competition in terms of CPU performance.
While the developers we've spoken to agree that heavily multithreaded game engines are the future, that future won't really take form for another 3 - 5 years. Even MS admitted to us that all developers are focusing on having, at most, one or two threads of execution for the game engine itself - not the four or six threads that the Xbox 360 was designed for.
Even when games become more aggressive with their multithreading, targeting 2 - 4 threads, most of the work will still be done in a single thread. It won't be until the next step in multithreaded architectures where that single thread gets broken down even further, and by that time we'll be talking about Xbox 720 and PlayStation 4. In the end, the more multithreaded nature of these new console CPUs doesn't help paint much of a brighter performance picture - multithreaded or not, game developers are not pleased with the performance of these CPUs.
What about all those Flops?
The one statement that we heard over and over again was that MS was sold on the peak theoretical performance of the Xenon CPU. Ever since the announcement of the Xbox 360 and PS3 hardware, people have been set on comparing MS's figure of 1 trillion floating point operations per second to Sony's figure of 2 trillion floating point operations per second (TFLOPs). Any AnandTech reader should know for a fact that these numbers are meaningless, but just in case you need some reasoning for why, let's look at the facts.
First and foremost, a floating point operation can be anything; it can be adding two floating point numbers together, or it can be performing a dot product on two floating point numbers, it can even be just calculating the complement of a fp number. Anything that is executed on a FPU is fair game to be called a floating point operation.
Secondly, both floating point power numbers refer to the whole system, CPU and GPU. Obviously a GPU's floating point processing power doesn't mean anything if you're trying to run general purpose code on it and vice versa. As we've seen from the graphics market, characterizing GPU performance in terms of generic floating point operations per second is far from the full performance story.
Third, when a manufacturer is talking about peak floating point performance there are a few things that they aren't taking into account. Being able to process billions of operations per second depends on actually being able to have that many floating point operations to work on. That means that you have to have enough bandwidth to keep the FPUs fed, no mispredicted branches, no cache misses and the right structure of code to make sure that all of the FPUs can be fed at all times so they can execute at their peak rates. We already know that's not the case as game developers have already told us that the Xenon CPU isn't even in the same realm of performance as the Pentium 4 or Athlon 64. Not to mention that the requirements for hitting peak theoretical performance are always ridiculous; caches are only so big and thus there will come a time where a request to main memory is needed, and you can expect that request to be fulfilled in a few hundred clock cycles, where no floating point operations will be happening at all.
So while there may be some extreme cases where the Xenon CPU can hit its peak performance, it sure isn't happening in any real world code.
The Cell processor is no different; given that its PPE is identical to one of the PowerPC cores in Xenon, it must derive its floating point performance superiority from its array of SPEs. So what's the issue with 218 GFLOPs number (2 TFLOPs for the whole system)? Well, from what we've heard, game developers are finding that they can't use the SPEs for a lot of tasks. So in the end, it doesn't matter what peak theoretical performance of Cell's SPE array is, if those SPEs aren't being used all the time.
-
Don't stare directly at the flops, you may start believing that they matter.