File system trees are inefficient and slow when locating files in a filespace occupied by billions of files and folders. Storing the data as objects in a flat storage space is becoming a recommended alternative. But, as soon as you go for object storage to defeat this file system tree traverse problem, you face a fresh problem: how do you locate your objects?
Either you have a central object map or database or you have a distributed one. French startup Scality has gone for the distributed approach with its Ring technology.
The idea is to have virtual unlimited scalability, both of I/O and storage capacity, by using clustered commodity X86 servers organised as peer-to-peer nodes – conceptually occupying a ring – with front end Accessor software nodes receiving requests from users and applications on servers.
Scality CEO Jerome Lecat says an Accessor node can access any Ring node, note the "any", and find the right node storing a requested object in one network hop with 10 nodes, two hops with 100 nodes, and three hops with 1,000 nodes.
Holy Trinity in Scality's Ring technology: Accessors to the left, the Ring in the middle and secondary storage to the right.
A variety of Accessor node technologies are supported: native REST HTTP, NFS, BRS2 and Zimbra.
With each 10X increase in the Ring node count, the hop count goes up by one because of Scality's patented technology and its algorithm. We might call this a quite peculiar Ring cycle.
Lecat said: "There are really two 'tricks' here. [First] an algorithm delivering a maximum of Log(n) complexity – which basically gives one a 100-node network. Each node needs to know seven nodes, and a request may take seven hops. The minimum requirement from a mathematical standpoint is for each node to know a few other nodes. The number of nodes increases as Log(number of nodes), which means that when the number of nodes is x10, you need to add 1 to the number of nodes to be known, or number of hops.
"[Secondly] in practice, we allow nodes to know many more nodes, but this acts as a 'non authoritative cache', and it allows for a request to 'usually' converge in two hops, while keeping all the mathematical properties of the model (Log complexity, limited number of hops, good behaviour when a node is lost or added)."
Each node can handle 10 to 50TB of storage, with 1,000 nodes supporting up to 50PB of capacity, and accessing the right object in that 50PB with three hops on a gigabit LAN takes 20ms or less.
Distributed hash table
How does that work? Scality documentation says that the Ring nodes are organised into segments. Objects are stored with a Distributed Hash Table (DHT) algorithm, which produces a value for the object and its associated key. Key and value pairs are stored in the DHT and nodes retrieve the value associated with a particular key. Responsibility for maintaining the mapping from keys to values is distributed among the nodes. Keys embed information about class of service, and each node is autonomous and responsible for consistency checking and rebuilding replicas automatically for its keys.
We can think in terms of Scality's Ring nodes crossing a key space. This is organised into a hierarchy such that a 10-node ring requires one node-to-node hop to find the target node, a 100-node ring needs two hops and a 1,000-node monster needs three hops.
Lecat says: "The key space is distributed among all the nodes. The key space is very large (20 bytes), and distributed nearly evenly, but never exactly evenly. The underlying algorithm is a distributed hash table. The 'segments' do not have a constant size (as everything has to be dynamic in the system to allow real elasticity).
"Two key properties of the key space are that keys have an order, and they are organised into a circle (which gives trigonometic properties)."
Let's take a 10-node Ring as an example. An Accessor sends in a object retrieval request to node 1, which doesn't have it. We're told the object can be retrieved with one hop, a jump from node 1 to the right node. Node 1 has enough information to send the request on to the right node, the one that holds the object, and so does every other node in the 10-node ring: that's how a distributed hash table works.
Scality doesn't say in detail how this works. I think it is a variation on this concept: each node has an ID and nodes are organised in a ring, a double-linked list, with each node having a reference to the previous node on the ring, its address, and the next ring node, and its address. Nodes going round the ring have successively greater node IDs until you return to the starting node.
Okay? Keep that in mind and let's move on to the request receiving node, which gets the key from the Accessor request and hashes it to generate a key of exactly the same number of bits as the node reference. The system uses this as a node ID and goes round the ring node by node, looking for a node ID that is the closest possible to the key hash while still being larger. That node should store the desired object.
A reversal of this is used to store incoming objects on the Ring and ensure they are locatable.
Lecat said: "If a node is lost, the ring rebalances itself without human intervention. [It's the] same if a ring node is added (human intervention needed to decide to add a node), the new node is automatically placed well in the key space, and rebalances only occur when necessary and automatically."
To understand any more than this requires a computer science skill set and access to the Scality Ring designers.
Reliability, autonomy and replicas
Scality says every node constantly monitors a number of its peers, presumably the ones in its segment, and automatically rebalances replication and load among them to make the system self-healing if a node crashes. The same mechanism can be used to cope with nodes jointing the ring as it grows.
When an object is first loaded, the Accessor node involved may assign a storage class to it which can define the number of disk failures it can survive: one, two or more. That implies that for it to survive one disk failure, there need to be two copies and so on. The Ring manages the number of replicas requested for each object.
The Ring also produces reliable performance, roughly equivalent to a storage area network, although it would be interesting to see what happens with a 10,000-node ring (four hops) and a 100,000-node one (five hops).
Lecat said: "The number of nodes is theoretically infinite since complexity increases less than the number of nodes. We are very confident about 1,000 nodes, but we have not been able to access enough hardware yet to actually test 10,000."
The Ring dumps low activity data off to a second tier of storage and the data sent there is compressed and deduplicated.
Customers and funding
Cloud service providers use Scality's Ring to offer storage-as-a-service applications, such as email. These customers are European, Scality not being a Silicon Valley-based startup, and include Belgian broadband cable service provider Telenet; Host Europe, which offers cloud hosting; German cloud computing providers ScaleUp and Dunkel; and German web hoster intergenia.
Revenues from these companies explain the relatively low funding level of the June 2010 A-round, in which Scality received $5m funding in June from three French venture capital concerns. This followed on from seed financing of $1.3m in February, when the firm started by morphing out of Bizanga. After that, the firm sold a Mail Transfer Agent product to Cloudmark and looked for a new business to enter. Much of this seed finance came from Scality's employees.
Scality has partnered with CommVault and CommVault Simpana can now be provided in the country of operation by local cloud storage providers using the Scality RING platform, or on-site for customers wishing to deploy their own private clouds with the same technology from Scality. That gets over an issue of legal and social needs for cloud-held data to be stored in the country of origin.
So what do we have here? Scality is an unusual creation: a France-based startup with up-to-the-minute object storage technology that is actually in use by service providers, billing customers real money for using Scality's Ring. They could have decided to use Caringo, Atmos or Centera from EMC, CleverSafe, or HCP from HDS, but they went for Scality instead.
Maybe you should check out Scality's Ring technology too. ®