Q&A with Matt Schumpert, Datameer

Q&A with Matt Schumpert, Datameer

Matt Schumpert, Director of Product Management, Datameer

Q1. You recently announced a partnership with Tableau. What is the business value of the Datameer and Tableau Server connector?

There’s a huge number of Tableau users already leveraging Tableau’s best-in-class visualization tools, but who also want access to big data and need a Hadoop-native analytics platform to ingest, prepare analyze, and aggregate multi-structured data prior to visualization. Increasingly, we saw our own customers leverage the products together in complementary ways and we simply wanted to tighten that integration point to make it seamless across Tableau’s product suite. With a few clicks, users can now publish any data set at any stage of a Datameer analytics pipeline directly into Tableau Server or Tableau Desktop, or to a centralized file sharing location like FTP, Amazon S3, or HDFS in Tableau’s native file format.

 Q2. You have announced new data governance capabilities for native Hadoop environment. What is it?

When it comes to governance, Datameer has long built technology out ahead of the Hadoop ecosystem. But, as Hadoop deployments have gotten more complex, there’s a need both for richer functionality and for integration between platforms and applications. To our existing capabilities of role-based security, Kerberos, LDAP, and Active Directory authentication, SAML-based SSO, field level security policies and data masking, we’ve added visual data lineage tools across the entire pipeline (from ingest to export), a new SDK for publishing events to an audit trail and versioning together with source control systems, and visual audit reports to help administrators track down how data has been shared (or potentially leaked).

Q3. Why does the world of big data need to take data governance more seriously in order to become ready for enterprise-grade deployments?

First, because many organizations need that level of increased governance to meet regulatory compliance requirements, and second, because they’re used to that within their EDW, the platform Hadoop often takes the place of for new workloads. Finally, in contrast to EDWs, there’s a greater propensity for derived data generated from both manual and automated data pipelines in Hadoop to be stored back into the warehouse. Keeping track of this data and with whom it’s shared can be challenging.

Q4. When talking about analytics across big data, how can companies handle the operational challenge of implementing and managing a high-performing, big-data environment?

First, you need a technology that makes the most efficient use of your Hadoop hardware. Datameer’s Smart Execution technology, with intelligent hybrid in-memory processing, was designed to do just that.  You then need to put in place workload management to prioritize workloads accordingly and make sure big workloads don’t get in the way of smaller ones. This is largely a matter of configuring Hadoop. Finally, you need to eliminate duplicate workloads (like doing the same big join twice in two different use cases). Datameer’s data lineage and audit tools make this easy for administrators, who can then easily refactor things to do the heavy lifting just once. Finally, Datameer provides visual tools that profile jobs and help locate inefficiencies. Users often forget that applying filters before joining or aggregation makes the most sense, and the system will show you that after your first run of an analytics workload.

Q5. How does Datameer handle the challenge of providing an analytics solution that is usable by a broad range of executives and business analysts?  

We took the most common form factor on the planet (with over 800 million users) – the spreadsheet. This allows users to instantly be comfortable working with Datameer, and apply some pre-existing skills, unlike other Hadoop tools, which feel foreign and don’t show users their data at every step.

Q6. Is cloud adoption changing the big data analytics landscape? If yes, how?

Slowly but surely, yes. Data is more and more often being generated and stored in the cloud, and it makes most sense to do your big data analytics close to the data. Cloud also offers the elastic scale-up and scale-down capability and pay-as-you-go economics you just can’t get on premise. Finally, IT organizations are finding that deploying and operating Hadoop can be expensive from a human capital perspective and the small-to-medium IT shops are starting to opt for public cloud Platform-As-A-Service Hadoop offerings like Altiscale or Microsoft HD Insights.  We expect this trend to continue.

Q7. What are in your opinion the three most successful big data use cases you have seen lately? 

Broadly speaking, Customer Analytics continues to be no. 1. Companies that in any way touch consumers are always trying to market better, optimize their acquisition funnel, combat churn, and up-sell/cross-sell using insights from big data. But they’re doing it in unique and different ways, like a retirement services firm identifying “signatures” of at-risk customers across multiple channels and using decision tree analytics to construct retention campaigns.  The Internet-of-Things is starting to become a factor with customers like Vivint optimizing the home automation experience through analytics on stream data from sensors.  Finally, the tried-and-true approach of offloading existing ETL processes from a costly Enterprise Data Warehouse is a typical “quick win” for big data.

 Q8. What are the main changes currently going on in Hadoop deployments?

I think Hadoop deployments are becoming more operational, meaning that insights generated get pushed directly back into OLTP systems that customers and back-office operations depend on.  This requires reliable SLAs for these workloads, and more change control procedures. Secondly, we’re starting to see in-memory processing make Hadoop a viable approach for lower latency workloads. Finally, deployments are just getting broader, going from a single data science group to IT departments operating a shared service and charging back against consumption by multiple business units.

Q9. What are the main technical challenges to handle Data integration and Data Curation?

You need a lot of connectors to reach all those data sources, and the capability to handle unstructured and semi-structured data well. You need data profiling, cleansing and masking capabilities built in, and visual tools to help users quickly assess the value and validity of raw data.  Finally, the whole system needs to be schema-less and designed for the runtime environment (natively Hadoop’s YARN) from the ground up, or it won’t be able to handle multi-stage processing (or, for that matter, get the most out of the Hadoop hardware).

 Q10. What do you think of the Open Data Platform initiative?

I think it’s important that the Hadoop ecosystem work toward standards and agree on how to interact with Hadoop services. This will ensure that as Hadoop continues to evolve, the ISV ecosystem around Hadoop (including Datameer) maintain support for the latest version and innovate without worrying as much about the shifting sands of APIs beneath them.

You may also like...