Thursday, July 15, 2010

EM Photonics Supports Fermi

EM Photonics, Inc. announced today the general availability of CULA 2.0, its GPU-accelerated linear algebra library used by thousands of developers and scientists worldwide. The new version provides support for NVIDIA GPUs based on the latest "Fermi" architecture, which offers HPC users unprecedented performance in double-precision mathematics, faster memory, and new usability features.

"The Tesla 20-series GPUs deliver a huge increase in double precision performance," said Andy Keane, General Manager for the Tesla high-performance computing group at NVIDIA. "The LAPACK functionality provided by CULA is critical to many applications ranging from computer-aided engineering and medical image reconstruction to climate change models, financial analysis and more. This new release is great news for developers who can easily accelerate their application with CULA 2.0.," he added.

"CULA 2.0 is the next step in the evolution of our product, where we can finally show strong double precision performance to complement our already impressive single precision speeds. Users of older GPUs will also see performance improvements as well as new routines and increased accuracy. As we continue tuning our CULA library for Fermi, users can expect to see even better performance as well as new features in the next few months," said Eric Kelmelis, CEO of EM Photonics.

Product Features

CULA contains a LAPACK interface comprised of over 150 mathematical routines from the industry standard for computational linear algebra, LAPACK. EM Photonics' CULA library includes many popular routines including system solvers, least squares solvers, orthogonal factorizations, eigenvalue routines, and singular value decompositions.

CULA offers performance up to a magnitude faster than highly optimized CPU-based linear algebra solvers. There is a variety of different interfaces available to integrate directly into your existing code. Programmers can easily call GPU-accelerated CULA from their C/C++, FORTRAN, MATLAB, or Python codes. This can all be done with no GPU programming experience. CULA is available for every system equipped with GPUs based on the NVIDIA CUDA architecture. This includes 32- and 64-bit versions of Linux, Windows, and OS X.

AccelerEyes Supports Fermi

AccelerEyes today announced that its Jacket software platform for MATLAB, including its new 1.4 release, will support the latest NVIDIA graphics processing units (GPU) based on the Fermi architecture (Tesla 20-series and GeForce GTX 4xx-series). NVIDIA's release of the Fermi architecture brings with it 448 computational cores, increased IEEE-754 floating-point arithmetic precision, error-correcting memory for reliable computation, and enhanced memory caching mechanisms.

AccelerEyes develops Jacket, a software platform that delivers GPU computing power to desktop users of MATLAB and other very high level languages. It enables faster prototyping and problem solving across a range of government, manufacturing, energy, media, biomedical, financial, and scientific research applications. The Jacket software platform enables accelerated double-precision performance for common arithmetic and linear algebra functionality on the new NVIDIA hardware based on the FERMI architecture.

"With this release of Jacket and with the familiar MATLAB environment, domain experts can create highly optimized heterogeneous applications for the latest CPUs from Intel and AMD while also leveraging the latest generation of GPUs from NVIDIA," said John Melonakos, CEO of AccelerEyes. "Efficiently using all available host cores for certain parts of an application while accelerating other portions on GPUs is the key to squeezing maximum performance out of today's GPU-enabled workstations, servers, clusters, and cloud services. With Fermi's improvement in double-precision performance, we expect a big increase in the number and type of applications that benefit from GPU acceleration."

Wednesday, December 16, 2009

David takes on Goliath…

… and, while the battle still rages on, David appears to be winning.

In April of 2004 Nvidia hired Ian Buck, who specialized in GPGPU research at Stanford, setting their foundation for CUDA development. Two and a half years later Nvidia launched CUDA. In June of 2008 Nvidia announced their C1060 Tesla card intended solely for GPGPU processing. Pat Gelsinger, Senior VP and co-general manager of Intel’s Digital Enterprise Group, stated in an interview in July 2008 that CUDA will be little more than an “interesting footnote in the history of computing annals”. Pat confidently proclaimed that Intel’s Larrabee chip will easily best the GPGPU ecosystem that Nvidia had built. Larrabee would not only be a HPC powerhouse but it would be a better graphics card as well.

Fast forward to December 2009 and you will find that Intel has thrown in the towel from the Larrabee graphics perspective and the HPC future of Larrabee is still somewhat murky. Intel claims that Larrabee will still be used internally and externally by developers as a HPC platform. Pat Gelsinger the “confident” Intel executive now works for EMC. But Intel wasn’t the only CPU juggernaut bearing down on Nvidia. In 2006 AMD purchased ATI and announced its vision of the future… Fusion.

AMD Fusion is the codename for a future next-generation microprocessor design and the product of the merger between AMD and ATI, combining general processor execution as well as 3D geometry processing and other functions of modern GPUs into a single package. AMD announced at its Financial Analyst Day in December 2007 that their first two Fusion products codenamed Falcon and Swift would be available in 2009.

Fast forward to December 2009 and you will find no mention of AMDs Falcon and Swift. However, AMD does not appear to have thrown in the towel yet. They are still working on Fusion and claim they will have something out in 2010.

In January 2009 AMD announced at CES a system called the Fusion Render Cloud. The system will be powered by Phenom II processors with “more than 1,000″ ATI Radeon HD 4870 graphics cards. The Fusion Render Cloud is supposed to move rendering of complex interactive, three dimensional scenes up onto internet Fusion Render Cloud servers and, using OTOY’s software, stream the data to a simple browser on a connected device, bringing this kind of content to devices that haven’t been able to handle it yet because of ”device size, battery capacity, and processing power” — think, cell phones and other mobile devices. The system will be ready the second half of 2009.

Fast forward to December 2009 (last time I checked December was the last month of the second half of the year) and still no Fusion Render Cloud. Interestingly enough… from out of the blue… Nvidia announced their RealityServer platform. The platform is a powerful combination of Nvidia Tesla GPUs and 3D web services software that delivers interactive, photorealistic applications over the web, enabling product designers, architects and consumers to easily visualize 3D scenes with remarkable realism. The RealityServer platform runs in GPU-based cloud computing environment, accessible using web-connected PCs, netbooks, and smart phones, enabling 3D web applications to dynamically scale based on utilization requirements. Oh… and it was available for purchase the day they announced it.

So while David has not killed Goliath yet… he seems to be making steady progress on the hardware and software side putting actual products in the hands of his users. It doesn’t look like anyone has told David that he is going to be little more than an “interesting footnote in the history of computing annals”. If they have… I don’t think he’s listening.

Thursday, December 10, 2009

State of the GPGP Union

I’ve been following the GPGPU industry for a few years now. Initially Nvidia was the only player with a supported development platform. While they provided support for GPGPU on most of their cards, they only had support for double precision on their Tesla cards. If you were interested in doing anything serious with GPGPU you used Nvidia’s CUDA SDK and a Tesla C1060 card (or S1070 1U server).

Fast forward a year or so and the GPGPU landscape has changed significantly. AMD / ATI have entered the GPGPU arena with their stream computing and Fusion initiatives. Intel claimed that they would take over both the graphics market and the HPC market with their Larrabee initiative and then they recently started backpedaling. With all of the recent changes I thought it would be useful to take a look at current (and soon to be released) product offerings from AMD /ATI and Nvidia.

First let’s take a look at the desktop offerings. While AMD / ATI and Nvidia support GPGPU on a large number of cards I will limit my comparison to the high end devices. (If you're reading this through a feed aggregrator you will need to hit the actual site http://gpgpu-computing3.blogspot.com/ in a browser to get the tabular data... sorry)

Desktop Cards



While at first glance it might seem as though the AMD / ATI cards offer better performance at a cheaper cost. You should be cautious in this assessment. Often times what these vendors are reporting is theoretical performance not actual. Additionally depending on what types of algorithms you are running you might be more limited by memory than the number of processors. Additionally Nvidia has been specifically modifying the design of their GPUs to make them better at performing GPGPU tasks while AMD / ATI has just gotten into the game. AMD / ATI is working with SciSoft to hammer out an OpenCL GPGPU benchmark suite. Once this is completed then we will be able to intelligently make comparisons.

Another important factor is product availability. Both ATI cards are currently available, but due to low fabrication yields, in limited numbers. On the Nvidia side only the C1060 cards are currently available. The C2050 and C2070 are rumored to be available Q2 2010. Nvidia does have what they are calling “The Mad Science Program”. This sales initiative will allow you to purchase current generation Tesla products that can be upgraded when the next gen products are released. All you have to do is pay the difference and ship back the old product.

Well that covers the desktop side of the GPGPU world but what about server side solutions? Well on the server side there are a slew of solutions based on Nvidia GPUs. Nvidia has a M1060 card that they sell to OEMs. The M1060 is designed specifically to be integrated by OEMs into server based solutions. The M1060 is similar to the C1060 minus the “thermal solution”. The M1060 relies on the server’s fans to cool it. This makes the card much smaller so that you can pack more of them into a server while still providing adequate cooling. As far as I know AMD / ATI has no such animal. In theory you could drop an ATI card into any Dell or HP server that has a PCIe x 16 slot in it but I wouldn’t be surprised if it overheated from time to time.

Another server based solution that Nvidia has is their GPU 1U offload servers. The term “server” is perhaps not the best descriptive term for these devices. They do not run an OS or have any CPUs. They are basically a 1U box containing a power supply, 4 GPUs, some heat syncs, and fans. These GPU offload servers need to be connected to a “pizza box” that actually runs an OS via a PCIe x 16 extension cable. Your GPGPU program runs on the “pizza box” and loads your GPU kernel across the extension cable, copies your data across the cable, crunches the numbers on the GPU offload server, and then finally copies your results back. Nvidia 1U offload servers info:

Offload Servers


The S1070 is available today but the S2050 / S2070 will not be out until sometime midyear 2010. Nvidia’s “Mad Science” program applies to these devices as well. As far as I know ATI has nothing currently that is comparable to these.

Something else that must be taken into consideration is OS support for the development toolkits (CUDA, OpenCL) provided by each vendor.


Supported OSs


Overall I’m trying to be objective and not have any bias towards AMD / ATI or Nvidia but it is difficult. I have been doing GPGPU for a few years now and it has all been on Nvidia based products. Once AMD / ATI has a product that I can evaluate on my OS of choice (REL) and they start providing server side solutions I may change my tune… but for now… I’m sticking with Nvidia.

Monday, December 7, 2009

Intel Gives Up on Larrabee

It looks like Intel is throwing in the towel on Larrabee. After investing hundreds of millions of dollars in development costs and demoing Larrabee at SC09 Intel seems to have given up on Larrabee as a general purpose GPU. They claim that Larrabee will still have a place in the HPC world but that they are abandoning their assault on the desktop graphics market.

When Intel embarked down the path to take on Nvidia and ATI with their Larrabee product line many insiders wondered if Intel had the GPU expertise to compete with Nvidia and ATI. After all this isn't the first time Intel tried to build a GPU and they didn't meet with much success the first time around. While everyone agreed that Intel had deep enough pockets to pull it off, the main question was were they willing to spend what it would take to be successful. Well it looks like we have the answer... No.

Thursday, November 5, 2009

Can Nvidia's Wile E. Coyote Kill off IBM's Roadrunner

The Roadrunner is an IBM BladeCenter QS22 Cluster consisting of 12,960 IBM PowerXCell 8i processors and 6,480 AMD Opteron dual-core processors. In November 2008 the Roadrunner was clocked at 1.465 PFLOPS (that’s pretty damn fast for a tiny bird). In fact it’s so fast that it is currently the fastest super computer on the planet according to http://www.top500.org . But how long will that last?

Well according to my calculations it ain’t gonna be long. I recently evaluated IBMs BladeCenter QS22. I didn’t have a cluster containing 1,080 of them connected with high speed Infiniband switches like the Roadrunner does… I only had 1. I ran a Monte Carlo Option model on it and compared it to similar Option models running on both an AMD 2.8GHz Opteron (baseline) and an Nvidia Tesla C1060.

I choose a Monte Carlo model for European Options because it is a widely understood model that requires significant raw computational horse power. Additionally each of the technologies that I evaluated provided a tuned version of this model for their particular platform. By comparing model code that is tuned for the particular platform it should insure best possible runtime times for the model on each platform. I modified their tuned code slightly to insure that it was functionally equivalent across all platforms. For comparison purposes I timed how long it took to perform a valuation of a single option with 1,000,000 random paths. I also did not including any time spent in one time device initialization calls in the overall runtime of the valuation.

Before I tested the new platforms I established a CPU based baseline. The baseline test was run on dual core 2.8GHz AMD Opteron server with 8G of memory. For purposes of comparison I will be comparing a single threaded CPU version of the model to the Cell and Tesla versions. The CPU version of the model took 1.1321 seconds to complete a single valuation. The model took 0.1260 seconds to run on the QS22 blade. On the Nvidia Tesla C1060 it took only 0.00187 seconds to run a single valuation.

The Tesla C1060 was 564X faster than the Opteron and 67 times faster than the Cell processor. Well to be fair the Monte Carlo algorithm is a data parallel algorithm which GPUs excel at. The Cell processor can also perform task based parallel algorithms where the GPU currently does not excel at tasked based parallel algorithms.

Why did I say currently in the last sentence? Because Nvidia is readying the release of their next generation chip codenamed Fermi. The Fermi chip will have double the number of processing cores that the chip on the C1060 does. It will also have 3 times the amount of shared memory. It also supports significantly faster atomic instructions that will enable the card to perform task based parallel algorithms significantly faster.

So why did IBM build a supercomputer out of something so blatantly inferior? My guess is that it was a last ditch effort to try and spotlight their QS22 as a viable product in the HPC marketplace before the GPU vendors take over that market segment. Did IBM succeed…? Nope!

Well they do currently occupy the top spot on the Top500 list but the folks at Oak Ridge National Laboratory have announced plans to build their next supercomputer based on clusters of Nvidia Fermi GPUs. They claim that their system will be 10X more powerful than IBMs Roadrunner. So there you have it… While IBMs Roadrunner is the top dog for now… it’s reign will be short lived.