SC06 A small software start-up thinks it might have "the little blue pill" necessary to keep massive clusters up and running at all times.
Evergrid this week unveiled something called the Availability Management Suite, but you might as well call it cluster Viagra. The company’s software works to make sure that minor – or major – system failures don't hamper the overall completion of large computing jobs. In total, Evergrid helps you keep your cluster up, sturdy and strong for hours and even days at a time.
Okay, we'll stop the bad "jokes" for a moment and get to the point.
Evergrid is working off the premise that the high performance computing industry lacks the proper tools for restarting jobs on large clusters. Systems made up of thousands of boxes will have failures – plenty of them – and need a way to get jobs restarted automatically.
Today, administrators must keep a close eye on their clusters, break jobs down into different chunks or try running jobs on smaller systems in order to avoid the pains of system failures. Evergrid tries to sidestep all this mess by running continuous "checkpoints" that capture the state of servers and their applications. If a failure occurs, the system can roll back to the last known state and then get cranking away on jobs once again.
Some companies and labs have developed their own checking systems over the years, but Evergrid believes that the industry is begging for a standard here.
Thankfully, customers will not need to rewrite any of their own software to make Evergrid's code work. The Evergrid "abstraction layer" slides in between an OS kernel and its applications. The checkpoints can then gather information on the state of memory, file I/O and the network at desired intervals.
Evergrid CEO's Dave Anderson, speaking here at the Supercomputing conference, told us that the monitoring software could chew through as much as 5 per cent of a system's resources. He, however, pitched that as an Armageddon type scenario and claimed that more often than not administrators will not notice the Evergrid code.
Come January, Evergrid plans to GA its Availability Services software and Resource Manager. Together, these products form the "flagship" Availability Management Suite.
The Availability package performs the checkpoint operations, while the resource manager handles a broad array of tasks such as making sure certain jobs get a set amount of processing power and priority levels.
We caught a demo of the Suite, and it worked as billed. An Evergrid administrator had three jobs spread across an eight-server cluster. The software could detect idle systems, throw them at jobs as needed and caught a failure when we ripped out one server. It stopped the one job that was affected and then got it up and running again in a couple of seconds on the available hardware.
By the second half of next year, Evergrid hopes to extend out of the HPC market and target smaller business clusters. It's looking, in particular, at the database and application server markets. Such customers might flock to Evergrid's tools for stopping jobs and then restarting them on new servers. This would allow a company to juggle different jobs with more flexibility than they have today.
For now, however, Evergrid will center on the HPC crowd that has enormous clusters built out of thousands of machines. Many HPC users deals with jobs that take days, weeks and sometimes months to process. They often have to restart these jobs from scratch due to system failures.
Evergrid has one patent and another handful of patents pending that it thinks will protect its IP from overzealous Linux coders who might come up with something similar. Some of you will be familiar with the company's CTO Dr. Srinidhi Varadarajan who built the massive G5 cluster at Virginia Tech and is credited with banging out much of the special "checking" sauce.
Evergrid claims two major customers at this point – an unnamed financial services company and the University of Oklahoma. The company has yet to set official pricing for its software, although Anderson guessed it will come in around "$250 per node with large volume discounts."
There's more information available here. ®