Join Newsletter
Trusted Business Advisors, Expert Technology Analysts

Taneja Blog

Taneja Blog / Big Data / Data Center Systems / Software Defined/Virtualized Infrastructure

A Shift To Enterprise Features For Big Data Solutions: Notes from Strata/Hadoop World NY 2014

I had a blast last week at Strata/Hadoop World NY 2014. I got a real sense that the mass of big data sponsors/vendors are finally focusing on what it takes to get big data solutions into production operations. In fact, in one of the early keynotes it was noted that the majority of the attendees were implementing software engineers and not necessarily analytical data scientists. Certainly there was no shortage of high profile use cases bandied about and impressive sessions on advanced data science, but on the show floor much of the talk was about making big data work in real world data centers.

I’ll certainly be diving into many of these topics more deeply, but here is a not-so-brief roundup of major themes culled from the 20+ sponsors I met with at the show:


  • Hadoop Vendors Expand Focus - The “distributions" entrench and expand. Cloudera adds more orchestration and provisioning to Cloudera Director, with new partnerships with Red HatEMC Isilon, and even Teradata. Note that it seems that most non-distro players will eventually have partnerships with most of the distros, so we take some of this with a grain of salt. For example Hortonworks already has a lot of the same partnerships and more - MS, HP, etc all hoping to get their agendas supported in base core Hadoop. 
    MapR has moved its high-performance Mbase over to its free community edition in order to better foster earlier development adoption. We aren’t surprised that MapR claims to be attracting many of its customers from folks running into production constraints with other distros. 
    Pivotal is making progress on distilling out a cohesive “3rd platform” application solution for the biggest industrial challenges, with performance differentiation at scale with GemFire, GemXD and HAWQ, and at the same time better open source/ecosystem contribution/adoptions like Spark, Tachyon and Ambari coming.
    Let’s not forget IBM Big Insights, which although way under-marketed, presents a good combination of open source-ness with some IBM specific additions like using GPFS instead of HDFS.
  • Storage is Still Key - Speaking of storage, it’s clear to most operational big data folks that the key to success is still in getting the storage layer right. Hadoop is really analytics converged with scale-out storage. We see a lot more options formally supported for external storage or alternatives to native HDFS - several just mentioned above plus more to come below. The takeaway is that native HDFS is really software defined storage for big data. And (like OpenStack Swift) if HDFS is viewed just as an API/protocol it can be provided by many means. If you invert your perspective, analytics can be pushed down into the storage paradigms as a data service (like in-memory grids). As important as the analytics are to big data, keep an eye on how to best implement storage to get the enterprise features you need.
  • Virtualized Hadoop - Which brings up virtualized Hadoop hosting. VMware BDE and Project Sahara (OpenStack) have been offering a way to virtually host compute and storage services, but these still feel like works in progress. Now along comes Blue Data, winning the award for best solution in the Strata startup showcase. They virtualize the compute side of big data to support instant compute cluster provisioning, while providing for a performance enhanced IO system that works with your existing enterprise data and storage so that you can analyze your “big data” from where it already exists - ostensibly already in an enterprise quality SAN or NAS.  No data migration, movement, or added data management headaches.
  • SQL on Hadoop - Another theme was about bringing transactional SQL to Hadoop. It’s clear that folks want to use existing skills and analytical processes and tools, and with some hard work SQL is coming to Hadoop in force. Both HP-sponsored open-source Trafodion  and Splice Machine are coming out of the shadows with straight on OLTP friendly SQL solutions for big data. MapR has always thought operational speed fine grained transaction support was important.  It seems a new era here, and naysayers who want to keep Hadoop pigeonholed in analytics only might find themselves left behind.  (and NoSQL was also big at the show, with a lot of MongoDB-like capabilities being discussed)
  • Hardware Becomes Important Again - Who says commodity architecture provides a better TCO? For out and out performance we see in-memory solutions that can greatly accelerate analytics like GridGain, whom we recently profiled in a full Taneja Group Report. 
    Dell offers a monster In-Memory Appliance for Cloudera Spark with 384GB memory. Cluster just 20 of these ready-to-go converged systems for 7.5 TB of memory.
    HP gets full end-to-end convergence going with their HAVEn stack  Hortonworks investment, and their SL4550 optimized for storage with 60 disks per 4U with an ability to leverage the unique Moonshot cartridges for compute intensity.
    SDN also rears up in this ecosystem via Plexxi to help dynamically cluster physically disparate resources as if they were all locally attached. 
  • Hadoop Operations and Management - The Hadoop cluster is looking more like a datacenter these days with better security options from vendors like Dataguise and Zettaset, large-scale data inventory solutions from the likes of Waterline Data (solving problems you probably don't know you'll have until you build a data lake/hub), and my favorite system management topic - performance and capacity optimization from Pepperdata. Pepperdata not only visualizes metrics on both the JVM and O/S sides of the equation, but dynamically (3-5 seconds) optimizes running jobs to ensure QoS and full resource utilization. What’s not to like about maxing performance and meeting SLAs on optimally sized architectures?
  • Oh and Analytics Too!  - Yes, there were of course plenty of interesting analytical solutions on display including new identity analytics from Novetta, end to end modeling workflow support from Predixion, lots of in-memory Spark talk, high speed event data analysis from Interana and more great data integration/workflow/visualization from the Pentaho team.  Someday soon I hope to do industry analyst justice to the analytics side of this whole Hadoop ecosystem.

Overall this was a thoroughly enjoyable show. I wish I had another day or two. There were just too many interesting vendors to even say “hi” to everyone. O’Reilly/Cloudera had moved it to the Javitz this time but still sold out with over 5500 attendees. As the Hadoop ecosystem expands to include more data center data, storage, and processing, expect this show to grow bigger and evolve too. 

Let me know if you were at the show and saw a trend that should be on my “short” list above!

  • Premiered: 10/22/14
  • Author: Mike Matchett
Topic(s): Big Data strata Hadoop Virtualization Management Storage


There are no comments to display. Scroll down to leave your own!


Leave a Comment

You must be logged in to comment. Click here to log in or register if you don't have an account.