Cleversafe and Hadoop
Cleversafe is announcing a platform integration with Hadoop as part of their upcoming Cleversafe 3.0 release scheduled for later this year. We rarely discuss product announcements when the product is not in general release. However, Cleversafe has a proven track record with object-based distributed storage. By combining distributed storage with computation, the new announcement should interest the fast-growing pool of Hadoop users.
In a ridiculously small nutshell, Hadoop allows deep business analytics for large volumes of semi-structured and unstructured data that are dispersed across multiple servers. Unlike a centralized database cluster sharing processors and storage, a Hadoop cluster consists of otherwise independent servers acting in parallel. Hadoop’s MapReduce feature spreads the computational data across the servers and each server is only responsible for processing its part of the data while Hadoop unifies and presents the output to users. The beauty of this is that a Hadoop cluster can be very large and powerful, while maintaining scalability and great economy using commodity servers.
But nothing is perfect. Hadoop provides data protection by replicating three copies of the data in case a server fails, but with data reaching petabyte and exabyte levels this becomes a very cumbersome process. There is also the issue that Hadoop’s file system (Hadoop Distributed File System, or HDFS) uses one server for metadata operations, risking a single point of failure.
At this point Cleversafe steps in by replacing HDFS with a platform consisting of Hadoop MapReduce plus Cleversafe’s Dispersed Storage Network (dsNet) system. (dsNet maintains the HDFS interface to MapReduce.) dsNet is an object-based dispersed storage system that stores metadata and content on dsNet Slicestor appliances that act as storage nodes. In the Hadoop platform, Cleversafe disperses the data to dsNet Slicestors and Hadoop performs its computations directly on the Slicestor nodes instead of Hadoop server clusters.
There is no single point of failure on the data or metadata side and no need to replicate three copies of the Hadoop data. Brand new protocols for the combined MapReduce/dsNet include SliceStream, where dispersed raw data is placed into contiguous chunks that can be read directly without being first reconstructed by Cleversafe’s reconstruction algorithm. Cleversafe is compatible with existing Hadoop deployments, both open source and commercialized.
Note that this is not the cheapest way to go by any means. Hadoop is highly economical because its users can deploy commodity servers to analyze even very large amounts of data. The consideration then for using Cleversafe’s Hadoop solution in the upcoming 3.0 release is two-fold: 1) Is it too great a risk for you to have a metadata server as a single point of failure? 2) Has your data increased to petabyte-plus scales that make Hadoop replication too cumbersome?
If you answer “Yes” to either of these questions then it will prove worthwhile to pay the extra for Cleversafe. The additional cost will be partially offset by the savings you’ll accrue by eliminating the 3x replication with HDFS, with all of its associated storage, space and power requirements. And with Cleversafe, you’ll have a more scalable and reliable solution to your big data needs going forward. Certainly this is the way a large part of the enterprise market will be going, thanks to relentless data growth and businesses’ need to make sense of it all so they can increase profitability and competitiveness.