Yes, Nehalem is fast.
If you want to see the dramatic effect that Intel's move to the QuickPath Interconnect and integrated memory controllers has had on performance with its "Nehalem EP" Xeon 5500 processors, take a gander at the first results on the TPC-C online transaction processing benchmark test.
Hewlett-Packard's two-socket ProLiant DL370 G6 server is the first Nehalem EP box to be put through the TPC-C paces, and it blows away more expensive four-socket ProLiant machines using Advanced Micro Devices' "Shanghai" quad-core Opterons and even gives four-socket machines using Intel's own "Dunnington" six-core Xeon 7400s a run for their money.
To say that the old Xeon DP and Xeon MP processors and their frontside bus architecture were memory constrained is an understatement. And the Nehalem EPs not only put a whole box of nails in the coffin for two-socket Xeon DP servers, but they are sizing up the Dunnington Xeons for their coffin, even before the octo-core "Nehalem EX" chips for four-socket and larger servers arrive either later this year or early next year. For a lot of workloads, a two-socket Nehalem EP does the same work as a Dunnington server, which costs a lot more money.
On the TPC-C test run by HP, the ProLiant DL370 G6 was configured with two quad-core Xeon X5570 processors running at 2.93 GHz. That's a total of eight cores, but equally importantly, that's 16 processor threads for running applications. The DL370 was set up with the maximum memory possible on the box, which is 144 GB using 8 GB DDR3 DIMMs. Among other things, the TPC-C test measures how much disk I/O is needed to saturate the processors running a mix of OLTP applications that simulate the operations of a warehouse (a real one, with forklifts, not a data warehouse).
The test measures how many new orders the warehouse can process as a bunch of other transactions are being run at the same time. With the large amount of memory on two-socket servers (at least compared to when the TPC-C test came out in 1993), it takes a lot of disk drives to saturate the system as transactions are running. In the case of the ProLiant DL370 tested by HP, the server was equipped with four SAS RAID disk controllers inside the server chassis, each with six disks to store log and OS image data, and then forty MSA70 disk enclosures (with 25 36 GB, 15K RPM drives each), nine MSA2324fc Fibre Channel arrays (with 23 of the same drives each) plus a few more drives thrown in for good measure were added to the system for a total of 1,210 disks and 60 TB of disk capacity.
It took eight of HP's DL360 G5 servers to simulate the 500,000 end users driving the system, and the box, which ran Oracle Enterprise Linux (Oracle's clone of Red Hat's Enterprise Linux 5) to cut costs as well as Oracle's 11g Standard Edition One database, was able to process 631,766 TPC-C transactions per minute (TPM). The hardware in the setup, which was mostly the disk storage, cost $666,040, and three years of maintenance on the system cost $69,910. The software cost a mere $5,800, plus $10,497 for maintenance. Adding in the client hardware and software (which you have to do as part of the TPC-C test) pushed the price tag on this two-socket system to $802,683, but after a 15.5 per cent discount, the price dropped down enough to get the system down to $1.08 per TPM.
Back in November 2008, HP tested a four-socket DL585 G5 server using AMD's Shanghai Opteron 8384 processor, which have four cores running at 2.7 GHz, for a total of 16 cores and 16 threads. Now, this DL585 G5 motherboard has 32 memory slots, but they are only DDR2 main memory, which runs slower than the DDR3 memory used with the Nehalems. This box, with twice as many sockets, nonetheless had 16 threads to run the database behind the TPC-C test, and even though it had 256 GB of memory and a very peppy HyperTransport interconnect (which Intel's QPI basically copies), AMD doesn't have simultaneous multithreading on the Opterons (which is dumb at this point).
Plus, the higher memory bandwidth of the two-socket Nehalem box allows it to best the four-socket Shanghai machine on the TPC-C test. The Opteron server had 732 disk drives (27.8 TB) and was able to process 579,814 TPM at a cost of 96 cents per TPM. The Opteron machine was running Windows Server 2003 and SQL Server 2005 (both at the R2 Enterprise x64 Edition SP2 level), so this might account for some of the performance difference. (This machine had a 16 per cent discount off list price for the hardware, software, and maintenance).
HP also tested a DL580 G5 server using Intel's six-core Dunnington chips back in January, and running the same Oracle Linux and 11g database setup, the Dunnington box, which used four of the six-core Xeon X7460 processors running at 2.67 GHz for a total of 24 cores, this DL580 box was able to process 639,253 TPM at a cost of 97 cents per TPM. Like the Opterons, the Dunnington chips do not support simultaneous multithreading (what Intel brands HyperThreading and which it smartly put into the Nehalem chips despite the extra transistors it requires), so 24 cores means 24 threads. That Dunnington machine tested by HP had 256 GB of memory, eleven disk controllers, and 1,052 disk drives (43.4 TB of capacity).
Here's the important bit: the DL580 G5 iron is a lot more expensive ($59,740 for the basic server, processors, and memory compared to $22,162 for the Nehalem EP-based DL370 G6 server) and because it is a four-socket box, it has to run the more expensive Oracle 11g Standard Edition (which costs $41,900 on the Dunnington box, compared to $12,700 for Standard Edition One, which is only available on two-socket boxes). The Nehalem EP box has almost as much main memory, as many execution threads, runs cheaper database software, does as much work, and costs about a third as much for the basic system - server, operating system, and database - not including the ridiculous amount of storage it takes to drive the TPC-C test.
Not that Dunnington machines have no place. IBM's four-socket System x3850 M2 server offers similar performance, at 684,508 TPM using the six-core Dunningtons (running the Windows 2003 and SQL Server 2005 combo), but costs a ridiculous $2.58 per TPM (even after a 34 per cent discount) because IBM charges too much for main memory and disk arrays. However, IBM doesn't stop at four sockets like other Dunnington machines, and its System x3950 M2 machine, which basically lashes two x3850s into a single NUMA cluster, can drive 1.2 million TPM at a cost of $1.99 per TPM after discounts. Again, the bulk of the cost of these machines is storage - the big one here has 143.3 TB of capacity across 1,931 disks, and those disks are necessary to drive the I/O behind the database transactions embodied by the TPC-C test, not for capacity.
The wonder, of course, is why HP and IBM didn't slap solid state storage in their machines and really bring down the disk drive count, and therefore the price. They'll figure this out sooner or later, thanks to the economic meltdown.
As for Xeon MP machines, don't expect a lot of traction until Intel delivers the Nehalem EX boxes. And the Nehalem EX machines are going to have to do a lot better than nine DIMM slots per socket, or a maximum of 288 GB of main memory, to impress a lot of data centers. There is some confusion as to whether the Nehalem EX machines will support FB-DIMM or DDR3 main memory, and how many channels will come out of the sockets. With 32 cores and 64 threads in a four-socket image, main memory really needs to be something closer to 576 GB on the Nehalem EX machines. That's the same 18 GB per core that the Nehalem EP gets in big configurations. ®