Thursday, November 5, 2009

Can Nvidia's Wile E. Coyote Kill off IBM's Roadrunner

The Roadrunner is an IBM BladeCenter QS22 Cluster consisting of 12,960 IBM PowerXCell 8i processors and 6,480 AMD Opteron dual-core processors. In November 2008 the Roadrunner was clocked at 1.465 PFLOPS (that’s pretty damn fast for a tiny bird). In fact it’s so fast that it is currently the fastest super computer on the planet according to http://www.top500.org . But how long will that last?

Well according to my calculations it ain’t gonna be long. I recently evaluated IBMs BladeCenter QS22. I didn’t have a cluster containing 1,080 of them connected with high speed Infiniband switches like the Roadrunner does… I only had 1. I ran a Monte Carlo Option model on it and compared it to similar Option models running on both an AMD 2.8GHz Opteron (baseline) and an Nvidia Tesla C1060.

I choose a Monte Carlo model for European Options because it is a widely understood model that requires significant raw computational horse power. Additionally each of the technologies that I evaluated provided a tuned version of this model for their particular platform. By comparing model code that is tuned for the particular platform it should insure best possible runtime times for the model on each platform. I modified their tuned code slightly to insure that it was functionally equivalent across all platforms. For comparison purposes I timed how long it took to perform a valuation of a single option with 1,000,000 random paths. I also did not including any time spent in one time device initialization calls in the overall runtime of the valuation.

Before I tested the new platforms I established a CPU based baseline. The baseline test was run on dual core 2.8GHz AMD Opteron server with 8G of memory. For purposes of comparison I will be comparing a single threaded CPU version of the model to the Cell and Tesla versions. The CPU version of the model took 1.1321 seconds to complete a single valuation. The model took 0.1260 seconds to run on the QS22 blade. On the Nvidia Tesla C1060 it took only 0.00187 seconds to run a single valuation.

The Tesla C1060 was 564X faster than the Opteron and 67 times faster than the Cell processor. Well to be fair the Monte Carlo algorithm is a data parallel algorithm which GPUs excel at. The Cell processor can also perform task based parallel algorithms where the GPU currently does not excel at tasked based parallel algorithms.

Why did I say currently in the last sentence? Because Nvidia is readying the release of their next generation chip codenamed Fermi. The Fermi chip will have double the number of processing cores that the chip on the C1060 does. It will also have 3 times the amount of shared memory. It also supports significantly faster atomic instructions that will enable the card to perform task based parallel algorithms significantly faster.

So why did IBM build a supercomputer out of something so blatantly inferior? My guess is that it was a last ditch effort to try and spotlight their QS22 as a viable product in the HPC marketplace before the GPU vendors take over that market segment. Did IBM succeed…? Nope!

Well they do currently occupy the top spot on the Top500 list but the folks at Oak Ridge National Laboratory have announced plans to build their next supercomputer based on clusters of Nvidia Fermi GPUs. They claim that their system will be 10X more powerful than IBMs Roadrunner. So there you have it… While IBMs Roadrunner is the top dog for now… it’s reign will be short lived.