Opinions Permabit has taken issue with our Isilon and a question of Big Data story, insisting that its Albireo deduplication technology can be applied to scale-out file data storage, and deliver a tenfold reduction in storage costs with no performance penalty.
We had previously asked Isilon's chief technology officer for the Americas, Rob Peglar, "Would scale-out filers benefit from having deduplicated files, assuming that did not reduce performance?"
He said that deduplication impacted storage performance and costs adversely with Big Data and shouldn't be used. Here are his points followed by Permabit's detailed rebuttal (italicised) of them:
Rob Peglar: In general, the answer is no. First, the assumption is incorrect; any data reduction technique of the three known (compression, deduplication, incrementalisation) has a performance (time) implication. In addition, deduplication also has a space implication; the tradeoff is metadata versus data.
At large scale, deduplication metadata becomes very significant. For example, holding hashes (CRCs) of each 4KB of data – a very common granularity – implies four trillion items of metadata for a data repository of small size, 4PB.
Permabit: Rob, we respectfully disagree with just about everything you have stated, as does the rest of the industry. Your comment may be true about legacy (backup) deduplication solutions but it's not the case with Permabit Albireo.
You're basically comparing your father's Buick with today's Hybrid technology!
The metadata for deduplication grows as the unique data being stored grows, so on a many petabyte system you require disk for the deduplication metadata. This is true, but irrelevant... in a system such as Albireo, the percentage of overhead is a bit over 1 per cent of the disk for 4K blocks. So, if you reduce space by 5-10X, the gain is huge.
Rob Peglar: If each hash structure (CRC & disk pointer, ie, given a hash, where is its data?) is only 64 bits, or 8 bytes, which is quite small, this means 32TB of hash metadata which must be completely consistent across all nodes at all times.
Permabit: Albireo is orders of magnitude more efficient than this example. Albireo only requires 0.1 bytes of RAM per block of data indexed. So, in your example of 4PB of data, Albireo requires just 100GB of hash metadata, NOT 32TB.
Rob Peglar: One must not only store that 32TB of data in stable and protected storage, but that storage must also be very fast, nearly as fast as the CPU’s ability to ingest it. It's cost-prohibitive to have each node with 32TB of RAM just to hold hashes.
Permabit: As stated above, this is just "old" math, we keep the hash process in RAM, which runs at processor speeds, and enable the performance needed to scale.
Rob Peglar: Plus, even if you did have 32TB of RAM, it also means the CPUs in each node having to read 16TB worth of metadata (in the worst case) for each and every write access to a file, no matter how small – to perform the dedupe hash check – and that searching alone is non-trivial, taking significant time.
Permabit: Again, because we made fundamental advances in indexing technologies, hash checks take less than 10 microseconds on average and because of our patented Delta Index technology the speed actually increases and the process becomes more efficient as the system stores more signatures.
Albireo Grid, with GX technology, allows for scalability across nodes. Albireo Grid consists of multiple Albireo clients, embedded in each OEM's storage software, connected to multiple Albireo Grid servers over the network.
In an Albireo Grid deployment scenario, each Albireo client performs its own content-level segmentation and hashing, then distributes the hashes across the Albireo Grid servers and load balances the grid. This is another performance and efficiency capability that Albireo delivers.
As the hashes are processed the server returns deduplication advice to the client and stores the hash key if it is unique.
Rob Peglar: The basic problem is that data is growing faster than the CPU’s ability to efficiently process the metadata for it. This is why at scale, deduplication is a non-optimal technique.
Permabit: This just isn't true with Albireo. As stated above, Albireo indexing is highly efficient, enabling hash lookups to be an insignificant load on the processors.
With the industry leading and patented indexing, hashing and memory management employed in Albireo the previous delimiters found in legacy dedupe have been overcome. These achievements enable Albireo to scale out limitlessly and yet perform at industry record breaking speeds making deduplication not only viable in today’s multi-core processor rich environments, but enabling vendors to finally deploy dedupe without compromise.
Rob Peglar: It may save some "end" space, but consider "Big Data" as discussed before. This data is often highly unique and rarely can be deduplicated. For example, web hits and traffic from end users. Each end user is unique – by definition – and must be able to be identified as such to analytic software. Each hit is at a different point in time, always changing, always incrementing.
Constant streams of new data being ingested are therefore rarely duplicated across the user universe. So, for "Big Data", deduplication is most often a bad trade-off – even if the CPU were infinitely fast, you wouldn't save much space on disk. Contrast this with traditional VM or VDI usage, where OS images are immutable and mostly read-only; here, deduplication is a good trade-off. But that is not the problem the industry is trying to solve now. The problem is ingestion, analysis and long-term storage and protection of massive (and growing!) amounts of unique and ever-changing data.
Permabit: The real challenge the industry is looking to solve is how to efficiently store massive amounts of data, which only gets further compounded with Big Data. A typical enterprise, even if it is in the Big Data business, still has over half of its data created by business applications. This includes data such as user directories, Office documents, email and databases. And as you mention, virtualisation continues to gain market share across the enterprise. This data has very high deduplication rates.
If you're an organisation dealing with PBs of storage, and you have the "right" deduplication technology (that doesn't impact performance and can scale), how could you not afford to implement a technology that could save you 10X or more in storage costs?
Rob Peglar reviewed these points and replied: "While I'm happy to see claimed improvements in hashing, the fundamental fact remains: dedupe may save some ... space in certain scenarios (eg, virtual machine images) but 'Big Data' ... is often highly unique and can rarely be deduplicated.
"For example, web hits and traffic from end users. ... the billions of people with a mobile phone, one of the thousands of people who undergo an MRI scan every day, one of the many millions with a Facebook account or a credit card transaction for that day. Each end user is unique – by definition – and must be able to be identified as such to analytic software. Each hit is at a different point in time, always changing, always incrementing.
"If these hits are chunked up in 64KB blocks for dedupe analysis, hardly any will be unique. Because of the time-sensitive nature of Big Data – it's not just the content, it's the time at which the content was generated that counts – this data is very unlike what you mentioned, ie, Home directories, Office-generated documents, and structured data. I am talking about unstructured, highly unique Big Data, the kind that is being generated by a wide variety of people every second of every day.
"So, for 'Big Data', deduplication is most often a bad trade-off – even if the CPU were infinitely fast, you wouldn't save much space on disk. Contrast this with traditional VM or VDI usage, where OS images are immutable and mostly read-only; here, deduplication is a good trade-off.
"But that is not the problem the industry is trying to solve now. The problem is ingestion, analysis and long-term storage and protection of massive (and growing!) amounts of unique and ever-changing data.
"Again, though, I'm happy to see claimed improvements in hashing technology; innovation is good."
There we have it: two differing points of view. Can they be brought together? ®