Over half of the top HPC centres in the world are using the Lustre file system, so the latest move to fund its development should make a lot of people happy.
And there will be a lot of brainpower and resources backing up the move: namely, supercomputer maker Cray, HPC storage vendor Data Direct Networks, and two nuke labs funded by the US Department of Energy - the Lawrence Livermore National Laboratory and Oak Ridge National Laboratory.
Cray, Data Direct Networks and the Lawrence Livermore and Oak Ridgehave labs have pooled resources to create Open Scalable File Systems. OpenSFS, a non-profit based in California, will fund the development of the Lustre parallel file system.
The Lustre file system is an open source project under the control of Oracle, if any open source project can be said to be under anyone's control. Of the top 100 HPC centers in the world, 60 of them are using the Lustre file system to feed data to their supers, according to Galen Shipment, technology integration group leader at Oak Ridge and a board member of OpenSFS.
The problem is not that Oracle is not continuing development on Lustre - the company put the finishing touches on Lustre 2.0 back in August - but rather that Oracle is not as interested in the HPC market as Sun Microsystems - and it has no intention of offering commercial-grade support for Lustre 2.0. Nevertheless, it is still offering support for the earlier Lustre 1.8 release. This is a big problem for HPC labs.
You might think that the big nuke labs already have the smartest people in the world working for them and that, if anything, it should be they who would offer tech support to the rest of the HPC community for products such as Lustre. I certainly thought that, based on the raw brainpower at these labs. Also, Lawrence Livermore was where a lot of the original Lustre file system work was funded out of DOE and put into production to put it through the paces. But Mark Seager, Lawrence Livermore's assistant department head for advanced technology - and an OpenSFS board member - said this is not what Lustre customers want.
While the hotshot HPC shops using Lustre are able to handle level one tech support fine, and can even wade in and offer level two support to offer bug fixes for the easy stuff, that's about as far as it goes.
"We need that third level of support backing us up for deep problems," explains Seager. "We do not have the manpower to do that deep support."
The need for deep and official support, and a product roadmap and development process that could accept the input and requirements from all the customers and shape future releases to address their needs, was why Peter Braam, a researcher at Carnegie Mellon University and founder of the Lustre project, created Cluster File Systems in 2001. Sun Microsystems bought CFS in 2007, and Oracle ate Sun in January.
The situation around Lustre was murky enough that Whamcloud, a startup founded with $10m in private equity funding and some of the top people involved with the development of Lustre, burst on the scene at the end of July to chase Lustre support contracts and do development on the parallel file system and submit code back into the Oracle-controlled Lustre code base.
Morse not interested in forking the code
Brent Gorda, Whamcloud's chief executive officer and the DOE administrator who was cutting the cheques for the development of Lustre many years ago, was adamant back in July that Whamcloud did not want to fork the code-base. And now Norman Morse, the CEO at OpenSFS and formerly the data centre manager at Los Alamos National Laboratory (another DOE nuke lab), is similarly adamant that OpenSFS is not going to fork the Lustre code either.
"The focus of OpenSFS is complementary to what Oracle is doing, with their focus on Solaris and their own hardware," Morse explains. "Our focus is on Linux and HPC workloads. We definitely want to cooperate with Oracle. We are not intending to fork the code."
That said, the big HPC labs that may not be Oracle customers for much longer still need tech support. This is where Whamcloud, Xyratex, Cray, and DDN come in. They will offer support for Lustre file systems (and maybe even IBM and Hewlett-Packard will do so if the mood strikes them now that Oracle is backing away from traditional HPC system sales). OpenSFS wants to fund the companies that are offering that deep support to make sure someone is doing it. The non-profit also wants to fund the development of features to stabilise and then scale Lustre, as well as fund whatever future file system will be created to store data on exascale-class systems many years hence.
"There is a strong likelihood that the POSIX interface can't scale to systems of that size," says Galen Shipman, group leader of technology integration at Oak Ridge National Laboratory. "We'll focus on Lustre initially, but it could be something else in the long run." And that is why the organisation is not called OpenLustre.
At the moment, Oak Ridge has a contract with Cray for supporting its 13 PB of Lustre file systems. Lawrence Livermore was a CFS, then a Sun, and then an Oracle customer and is now looking to shop out support contracts once it moves off the Lustre 1.8 release that Oracle is now supporting. Seager says that part of the $2m in seed money that the four companies have put into the kitty to fund OpenSFS is for a support contract for Lawrence Livermore. And it is not a foregone conclusion that this support contract will go to Whamcloud, either, since it will be put out for bidding.
Morse says that OpenSFS will not have any programmers of its own, but will instead prioritise requirements coming out of the Lustre community and then fund development of features by paying third parties to do the work under contract. These could be individual developers or companies such as Whamcloud, Xyratex, Cray, DDN or Oracle. Whatever code is developed will be handed back to Oracle, and it is hoped that this code will be accepted by Oracle for inclusion in the Lustre tree.
Both Whamcloud and OpenSFS have similar hopes of cooperation with Oracle. But what happens if Oracle doesn't want to play it that way? At a certain point, the users of Lustre, through OpenSFS if they all join up, will have no choice but to fork the code. And to be more accurate, if all of the Lustre customers worldwide stand behind OpenSFS, endorse features, and fund their development, then it will be more accurate to say that whatever Oracle has in its Lustre implementation is the real fork.
I think it is safe to back the ones with the nukes over Larry Ellison if this should become a fight. ®