Taking on some of big data’s biggest challenges
by Roberto Zicari · Published · Updated
If you’re a practitioner of big data, you probably know that being a big data expert is more than just knowing HDFS and a little coding.
What’s most interesting is that the explosion in data volumes is moving us toward a complete different set of technologies that will form the fabric of computing in the years to come. The race to develop a more robust big data platform is shaking up the status quo and the very nature in which we will deliver IT solution will change. The industry is starting to develop the building blocks of a big data utopia, based on the biggest challenges that practitioners face when delivering big data insights, and it’s very exciting.
While working at HP, it’s exciting to see how the company is taking on some of Big Data’s biggest challenges in while in a race with its competitors. The battle for bigger big data is being waged on many fronts.
Front One: Big Pipes
The way we connect computers together has evolved from hubs to switches and beyond. However, so much of what we do with big data is limited by network traffic. Nodes in a Hadoop cluster have to talk to each other and must not be bogged down in meaningless chatter that often happens between nodes. Networks of yesterday are somewhat chatty and involve nodes in activities that could be avoided. It becomes clear that network traffic needs a new look.
Companies like HP are developing Software Defined Networks (SDN) that are smarter about how data is transmitted across the network. Rather than the old method of oddly bouncing around, data can take a more direct path to its destination, making for a much more efficient network. Networks can be virtualized, giving more network power to critical workloads.
There is perhaps the most “battling” going on this front, where network equipment manufacturers are choosing sides between open standards and proprietary technology. The battle is on to provide the best user experience to manage the new pipes.
Front Two: Data Storage and Compute Resource Managers
We can continue to throw more and more computing racks at the big data problem, however you often don’t need the same amount of compute resources as you do data storage resources. Cookie-cutter servers don’t make the most efficient use of compute and when you’re trying to solve a big data problem, you may find your CPU utilization pinned and your storage happily utilizing 25%.
To remedy this, vendors need to embrace the fact that computing power is a resource and storage is a resource, and the two should be virtualized, or separated in order to effectively deal with big data. By separating the compute layer from the storage layer, you can achieve significantly better efficiency by dedicating the right number of CPUs or storage devices to the workload.
HP’s enterprise group is working on this battle front. In working with partners in the Hadoop space, HP’s enterprise group is able to virtualize the compute layer and storage layer to customize the Hadoop workload. You can install more low-cost, low-power moonshot nodes to boost compute, or more high capacity storage nodes if you need them. Big Data Reference Architectures are available for this purpose.
Front Three: Polyglot Persistence
In the enterprise today, data sits in Hadoop, on on-premises storage and in the cloud. It is available unstructured and structured, optimized for analytics and raw. That said, it’s a chore to have to move it, structure it and otherwise munge it. It shouldn’t matter where the data sits or what format it’s in. You should be able to perform analytics on it without regard to any of these factors. If you’re a polyglot, you’re “someone who speaks or writes several languages”. For Hadoop, you have to speak ORC, Parquet, AVRO and other popular formats. You have to speak SQL, Python, R and C++ in your analysis of the data.
At HP, the Vertica team has just announced the capability to do this by offering HP Vertica for SQL on Hadoop. This new functionality offers the capability to leverage the same analytical engine just about anywhere. The engine speaks the language of popular analytical applications and access the data in popular file formats. The battle on this front relies on the experience of the Vertica engine and the mature way it can be a polyglot.
There are many other battlefronts in the war for big data. For example, something as mundane at backup in the world of big data has to be reworked. Where does the big data in my data lake go? How does it get there without bogging down servers and network? How can I restore only the parts of the backup I need?
Companies are looking at things like concurrency and democratization of information to information consumers, data security and encryption, operational and real-time analytics and so on. Perhaps I’ll look at some more areas in an upcoming post. It’s great to see that HP is striving to succeed in all of these areas.
While these technologies are emerging, the vision for our future might be an integrated system that seamlessly puts together all of these factors. Think about how powerful it would be to have a notepad that could seamlessly use all of your compute and storage resources over a direct and very fast pipe. The operating system wouldn’t care necessarily where the data was stored or what format it was in and could access your computer resources anywhere. This data utopia is probably not in our near future, but perhaps someday… and that’s when the fun begins.
Sponsored by Hewlett Packard Enterprise.