Join Newsletter
Forgot
password?
Register
Trusted Business Advisors, Expert Technology Analysts

Taneja Blog

Taneja Blog / Big Data / Software Defined/Virtualized Infrastructure

VMware Adds Hadoop into vSphere with Big Data Extensions

VMware announced vSphere Big Data Extensions this week, which might at first seem to be just a productizing of some open source Hadoop deployment software, but if you dig in a bit you can see that the big future of Hadoop might just be virtual hosting, a big shift from its intentional commodity server roots.  And this puts VMware on top of data center workload trends towards scale-out computing apps and offering everything "as a service".

New vSphere Big Data Extensions

VMware has been playing around with how to best support Hadoop (and by implication future scale-out computing applications) with their open source Project Serengeti rolled out last year in June 2012. With that project virtual admins could quickly spin up and down arbitrary Hadoop clusters within the virtual environment. That has proved popular despite some concern that perhaps a virtualized hosting of Hadoop, which is originally designed to take advantage of Map Reduce style distributed processing on a scale-out cluster of commodity servers, would be inefficient and costly in a virtual environment.  

To address that concern, VMware has since been working hard to show that an optimized deployment of Hadoop in the virtual environment can not only provide comparable performance and an efficient use of resources, it delivers other enterprise (and service provider) friendly benefits like ease of management, elasticity, and effective multi-tenant/mixed workload use of infrastructure.

Today with the announcement of the beta of "vSphere Big Data Extensions" (BDE), they are providing built-in optimized support and operational management for Hadoop for vSphere and vCenter. Virtualized Hadoop clusters can be actively managed through a virtual-admin friendly BDE GUI, and it comes with internal algorithms to monitor and manage specified QoS levels between Hadoop clusters (i.e. a production one v.s. development/test clusters). BDE can enforce automated "elasticity" actions like spinning up and down nodes in each cluster to maintain balance (and in the process optimize utilization over time).  BDE also helps steer around and quickly recover from disk failures - something that occurs more and more often as data sets go "big".  

How Does Big Data Virtualize?

Technically Hadoop cluster nodes are being run as VM's. Multiple VM "nodes" can be run on each hypervisor server, which enables optimizing infrastructure utilization, growing the cluster elastically and dynamically, and leveraging vSphere's HA/FT features.  One of the interesting options is to run compute nodes and data storage nodes as separate VM's to support orthogonal scaling and optimal usage of each resource.  (Another option is to leverage SAN/NAS storage, which would be the subject of a much longer conversation than we can we fit here in a blog post).

To make this work, the trick is to have highly automated management. But even then there is the initial concern about performance since a virtual Hadoop deployment seems to "break" the direct commodity server compute with DAS model.

As to performance, the hypervisor itself doesn't intrude on it's clients' performance in any noticeable way. The main performance concerns are:

1. Getting an optimal number and mix of Hadoop node VMs assigned to each physical hypervisor server. This is highly dependent on the specific Hadoop applications and data sets over time. Obviously VM's share and therefore compete for hypervisor resources, but sharing pooled resources is also an opportunity to extract more mileage out of infrastructure resources.

Not only can virtual admins dynamically create, destroy, shrink and enlarge virtual clusters, VMware's BDE has QoS algorithms so that prioritized Hadoop clusters will get resource priority over other sharing Hadoop clusters, even to the point of reducing the active node counts of lower priority clusters dynamically.

2. Ensuring locality of data to the compute jobs.  In a physical Hadoop cluster, HDFS chunks data out over the cluster, while compute jobs are mapped out "local" to each data chunk (if not to the same node, then to the same "rack" if possible).  In a virtual hosted cluster, there needs to be additional management  to ensure data locality.  For example, it's not desirable to simply let vMotion just move Hadoop VM's around based on high water compute thresholds, nor is Storage vMotion going to be very useful on big data sets.  On the other hand, these facilities can be readily leveraged to offer higher availability and fault tolerance than on physical Hadoop clusters.

To address this, VMware has contributed the Hadoop Virtual Extensions (HVE) into Apache Hadoop (1.2) which helps Hadoop nodes become "data locality" aware in a virtual hosting environment. Data locality knowledge is important to keep compute tasks close to required data. Native Hadoop knows about data locality to the node and rack level, but with the extensions, Hadoop becomes more "virtualization aware" with a concept of "node groups" that basically correspond to the set of Hadoop virtual nodes running in each physical hypervisor server. The node group level effectively enables Hadoop to know about data locality in a virtual hosting environment where local storage is being leveraged.

Big Bottom Line

Hadoop is generalizing (i.e. YARN, et.al) to become a more general purpose scale out computing platform.  The combination of dead simple Hadoop deployment and operations in a standard virtualized environment (with no loss of performance, potential cost benefits, and providing cloud-oriented services) will greatly accelerate the adoption of scale-out computing solutions in enterprise data centers. On premise Hadoop will become easily available to everyone who has some capacity in their VMware cluster.  If you have to expand your VMware cluster a bit to play with Hadoop, it's hardly a risky bet since the resources could be allocated to any other virtualized workload.

We saw VMware give up Cetas to Pivotal, and now this makes sense as now VMware looks more like the app-agnostic wall-to-wall data center platform. The BDE support today Cloudera, Hortonworks, MapR, and Pivotal HD, and will no doubt extend to other distros agnostically.

VMware's BDE (along with the Project Serengeti and the HVE) now offers IT a compelling "private" cloud big data service opportunity. I also imagine that public clouds running vSphere will be able to use this to offer public cloud competition to AWS EMR.

VMware is aiming to make vSphere not only a critical business workload platform, but also the scale-out application platform of the next generation data center.  By supporting scale out computing, vSphere might claim to be the only platform enterprises need for all workloads, current and future.

Bookmark and Share
  • Premiered: 06/28/13
  • Author: Mike Matchett
Topic(s): Big Data Virtualization VMWare

Comments

There are no comments to display. Scroll down to leave your own!

 

Leave a Comment

You must be logged in to comment. Click here to log in or register if you don't have an account.