EMC is betting big on big data analytics and has integrated the Hadoop filesystem into its Isilon scale-out filer offering and enabled its Greenplum analytics product to use Hadoop data.
Hadoop is an object-style distributed and scalable open source filesystem (HDFS) implemented across a cluster of datanodes and a single NameNode, with a secondary NameNode in larger clusters to snapshot the primary NameNode's data structures and to be used as a rebuild resource if the primary NameNode fails. The NameNode contains metadata about files stored in the datanodes which serve them up on request.
HDFS is popular today in universities, especially in life sciences, as well as for some Web 2.0 applications. Part of EMC's pitch is that the NameNode is a single point of failure and there is no high-availability for it, effectively, it claims, ruling it out for enterprise data centres. The company reckons that there is a large opportunity to provide Hadoop systems for big data analytics in corporate data centres if HDFS could be made usable in the enterprise-sense and manageable by ordinary storage admins. That's what it's doing by providing an integrated Isilon-HDFS storage back-end for a Greenplum HD analytics front-end.
With the Isilon OneFS v6.5 release, EMC has provided a one-stop Apache Hadoop shop and what it sees as missing facilities in the Hadoop world, namely:
- A sharable instead of a dedicated storage infrastructure;
- high availability for the NameNode;
- protection through snapshots (SnapshotIQ), replication (SyncIQ) and backup (NDMP, backup);
- improved storage efficiency beyond the 3X data mirroring of basic HDFS to the 80 per cent level;
- ability to scale compute and capacity separately; and
- automated data import/export via NDS, CIFS, FTP, and HTTP
Nick Kirsch, Isilon's director of product management, said of the NameNode implementation: "This is unique. The NameNode is now part of our distributed metadata. Every node is a NameNode."
Next Greenplum has certified Apache Hadoop, provided platform management and control, and parallel analytics access with the Greenplum database. EMC is also providing design and training services, 24x7 support around the world and a roadmap for development.
EMC contrasts its approach with that of Oracle and NetApp, neither of whom, EMC claims, can provide Hadoop natively integrated with their storage arrays; full HA for the NameNode; the same level of storage efficiency; multi-protocol access; and corporate-level protection features.
Purdue University has tried out the Isilon/Hadoop combo in its statistics department and has endorsed it, saying that there is now no need for a separate Hadoop data silo and that its users now had "a single, shared storage resource for data computing and analytics". Its statisticians do more statistics and less Hadoop infrastructure management.
EMC claims these added features will make Hadoop more usable by enterprises and also that enterprise Hadoop users will increasing look to data scientists (See Wikibon description) to statistically analyse their big data sets for meaningful – and monetisable – information. After all, the ability to monetise the crunched data is the big data pay-off.
EMC Greenplum HD on Isilon is available immediately through EMC and its channel partners. ®