SC09 In the supercomputer racket, you can be a niche player, a volume server maker with some HPC smarts, or a carcass that one of the other two feasts on.
As a boutique supplier of high performance parallel clusters, Appro International has to use every weapon it can get its hands on to distinguish itself from IBM, Cray, and Silicon Graphics. And it has to find an edge and make sales to prevent itself from becoming a carcass like the original SGI (eaten by Rackable Systems and renamed Silicon Graphics), the original Cray (eaten by Tera Computer and renamed Cray) Linux Networx (eaten by SGI before it was eaten by Rackable), Thinking Machines, Kendall Square, and Cray Research (which Sun Microsystems ate with nary a burp).
John Lee, vice president of advanced technology solutions at Appro, makes no bones about it. "To be a leader in HPC, we have to take advantage of bleeding-edge technology," says Lee. "And we believe that GPU will take off in HPC next year," he adds, referring to the graphical processing unit co-processors that Nvidia and Advanced Micro Devices have been selling and Intel is hoping to get to market sometime before too long.
"It now it is like having a third viable candidate. We have always had Intel and AMD for running code, but now Nvidia is going to be out there with Fermi."
The Fermi kickers to the current Tesla GPU co-processors were detailed in early October by Nvidia and have already been tapped by Oak Ridge National Laboratory, one of the US Department of Energy super centers, to be a part of a next-generation cluster it plans to build.
Unlike the Tesla GPUs, which have crap performance on double precision floating point math, the Fermi GPUs will deliver around 500 gigaflops each. This is enough for Oak Ridge to be talking about building a 10 petaflops hybrid super.
But according to Lee, the double precision math is not the only breakthrough coming with the Fermi GPUs. Error correction is key, and something that has been missing from all GPUs to date. "If bits flip here and there, which is the point of having a machine run very fast if you can't trust the answer you are going to get?" Lee asks rhetorically. "No serious HPC center will touch these GPUs until there is error correction."
Appro already bundles Tesla GPUs in its HyperPower clusters, which can deliver 304 x64 processor cores and 18,240 GPU cores for a machine that yields 6.56 teraflops per rack in double precision floating point and 78 teraflops per rack in single precision. But very few applications can make use of single precision in the HPC space, and without error correction, that's two strikes against the current crop of GPUs.
When the Fermi GPUs from Nvidia are ready - the word on the street is that Nvidia will start shipping them in the first half of 2010 - Appro says it will bundle them into its HyperPower machines as well as its Xtreme-X1 high-end supercomputers, which currently do not support GPUs. (The Xtreme-X1 line is also being rigged with Intel flash memory and future "Sandy Bridge" Xeons for the "Gordon" super at the San Diego Supercomputer Center, a $20m deal Appro announced last week for delivery in 2011).
As for IBM's Cell co-processors (which are not GPUs but which have multiple extra processing units wrapped around a Power core that provide similar math power) and Intel's future "Larrabee" x64-compatible GPUs, Lee doesn't have much enthusiasm for them. "We believe that Cell is on its way out," says Lee. "With what Nvidia has done, Cell will have a very short life. When we talk about GPU computing, Nvidia is really the only viable player - not just because of Fermi, but because of CUDA."
CUDA is the parallel programming environment that allows C programs to call the GPU to feed it math. It still needs C++ and Fortran hooks, by the way, but hopefully these will be ready with Fermi. If AMD gets error correction on its Firestream GPUs, there's a chance for AMD to step up and compete, and Lee is not silly enough to count of Intel's Larrabee entirely. "As Intel has shown with the Nehalem Xeons, when it focuses, it can deliver." All that said, Lee believes Nvidia has a 24 month lead in GPUs, which is forever in the supercomputing space.
That's a little bit harsh on the Cell chip, which has delivered better double-precision floating point performance than Nvidia Teslas, which have been in the market for two years, and which are used in the second-most powerful supercomputer in the world, the 1 petaflops "Roadrunner" hybrid Opteron-Cell super at Los Alamos National Laboratory. IBM's two-socket QS22 blade server delivers 460 gigaflops of single-precision and 217 gigaflops of double-precision math.
Big Blue competition
According to roadmaps that El Reg has seen, IBM is supposed to be cooking up a QS2Z blade, which is supposed to have two Cell chips that in turn have two Power cores and a whopping 32 vector processors each, using a next-generation memory and interconnection technology. This QS2Z blade would sport 2 teraflops per blade at single precision and 1 teraflops per blade at double precision.
This blade could, in theory, compete with the Fermi GPUs. But probably not at anything close to the same price. Which is probably why Oak Ridge went with Fermi GPUs. (It is not clear when or if this QS2Z blade from IBM will come to market, but it was supposed to be in the first half of 2010).
Even with its substantial lead in GPU co-processing, the physical form factor of Nvidia's GPUs is still going to present HPC vendors with one challenge: integrating the GPUs into their servers. It's not like server motherboards have multiple GPU sockets that allow them to be snapped right into the system board. They are still linking in through PCI-Express ports, and they are still hot and cannot be densely packed into clusters.
The other thing that Appro will be previewing at SC09 in Portland this week is its future Opteron server lineup, and Lee is calling the "Magny-Cours" twelve-core Opterons due in the first quarter of 2010 AMD's "comeback play." Appro will do a complete product refresh of its Opteron super line with the Magny-Cours chips, now called the Opteron 6100s and their G34 socket. "For many HPC codes, customers will still need a general purpose CPU," explains Lee. "It has been a tough year for AMD, but with the G34 processors, we think it will start to come back."
AMD started touting the impressive memory bandwidth of the Opteron 6100-G34 systems last week, showing that four-socket box will be able to deliver around 100 GB/sec of memory bandwidth on the Stream benchmark test.
At the moment, Appro doesn't have much use of the Opteron 4100s and their C32 sockets, which are a variant of the current Rev F 1,207-pin sockets, as AMD divulged last week. This low-end, low-power Opteron 4100 chip, which comes with six-core and eight-core processors, could be used as a host for multiple Fermi GPUs, Lee concedes. "If you are buying a GPU system, a low-cost host makes sense." But for real HPC work, Lee says the Opteron 4100s don't have enough cores or memory bandwidth to be practical. ®