ODBMS Industry Watch » cloud stores http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Sun, 02 Apr 2017 17:59:10 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.13 On the Challenges and Opportunities of IoT. Interview with Steve Graves http://www.odbms.org/blog/2016/07/on-the-challenges-and-opportunities-of-iot-interview-with-steve-graves/ http://www.odbms.org/blog/2016/07/on-the-challenges-and-opportunities-of-iot-interview-with-steve-graves/#comments Wed, 06 Jul 2016 09:00:29 +0000 http://www.odbms.org/blog/?p=4172

“Assembling a team with the wide range of skills needed for a successful IoT project presents an entirely different set of challenges. The skills needed to build a ‘thing’ are markedly different than the skills needed to implement the data analytics in the cloud.”–Steve Graves.

I have interviewed, Steve Graves, co-founder and CEO of McObject. Main topic of the interview is the Internet of Things and how it relates to databases.

RVZ

Q1. What are in your opinion the main Challenges and Opportunities of the Internet of Things (IoT) seen from the perspective of a database vendor?

Steve Graves: Let’s start with the opportunities.

When we started McObject in 2001, we chose “eXtremeDB, the embedded database for intelligent, connected devices” as our tagline. eXtremeDB was designed from the get-go to live in the “things” comprising what the industry now calls the Internet of Things. The popularization of this term has created a lot of visibility and, more importantly, excitement and buzz for what was previously viewed as the relatively boring “embedded systems.” And that creates a lot of opportunities.

A lot of really smart, creative people are thinking of innovative ways to improve our health, our workplace, our environment, our infrastructure, and more. That means new opportunities for vendors of every component of the technology stack.
The challenges are manifold, and I can’t begin to address all of them. The media is largely fixated on security, which itself is multi-dimensional.
We can talk about protecting IoT-enabled devices (e.g. your car) from being hacked. We can talk about protecting the privacy of your data at rest. And we can talk about protecting the privacy of data in motion.
Every vendor needs recognize the importance of security. But, it isn’t enough for a vendor, like McObject, to provide the features to secure the target system; the developer that assembles the stack along with their own proprietary technology to create an IoT solution needs to use available security features, and use them correctly.

After security, scaling IoT systems is the next big challenge. It’s easy enough to prototype something.
But careful planning is needed to leap from prototype to full-blown deployment. Obvious decisions have to be made about connectivity and necessary bandwidth, how many things per gateway, one tier of gateways or more, and how much compute capacity is needed in the cloud. Beyond that, there are less obvious decisions to be made that will affect scalability, like making sure the DBMS used on devices and/or gateways is able to handle the workload (e.g. that the gateway DBMS can scale from 10 input streams to 100 input streams); determining how to divide the analytics workload between gateways and the cloud; and ensuring that the gateway, its DBMS and its communication stack can stream data to the cloud while simultaneously processing its own input streams and analytics.
Assembling a team with the wide range of skills needed for a successful IoT project presents an entirely different set of challenges. The skills needed to build a ‘thing’ are markedly different than the skills needed to implement the data analytics in the cloud. In fact, ‘things’ are usually very much like good ol’ embedded systems, and system engineers that know their way around real-time/embedded operating systems, JTAG debuggers, and so on, have always been at a premium.

Q2. Data management for the IoT: What are the main differences between data management in field-deployed devices and at aggregation points?

Steve Graves: Quite simply: scale. A field-deployed device (or a gateway to field-deployed devices that do not, themselves, have any data management need or capability) has to manage a modest amount of data. But an aggregation point (the cloud being the most obvious example) has to manage many times more data – possibly orders of magnitude more.
At the same time, I have to say that they might not be all that different. Some IoT systems are going to be closed, meaning the nature of the things making up the system is known, and these won’t require much scaling. For example, a building automation system for a small- to mid-size building would have perhaps 100s of sensors and 10s of gateways, and may (or may not) push data up to a central aggregation point. If there are just 10s of gateways, we can create a UI that connects to the database on each gateway where each database is one shard of a single logical database, and execute analytics against that logical database without any need of a central aggregation point. We can extend this hypothetical case to a campus of buildings, or to a landlord with many buildings in a metropolitan area, and then a central aggregation point makes sense.

But the database system would not necessarily be different, only the organization of the physical and logical databases.
The gateways of each building would stream to a database server in the cloud. In the case of 10 buildings, we could have 10 database servers in the cloud that represent 10 shards of that logical database in the cloud. This architecture allows for great scalability. The landlord acquires another building? Great, stand up another database server and the UI connects to 11 shards instead of 10. In this scenario, database servers are software, not hardware. For the numbers we’re talking about (10 or 11 buildings), it could easily be handled by a single hardware server of modest ability.

At the other end of the scale (pun intended) are IoT systems that are wide open. By that, I mean the creators are not able to anticipate the universe of “things” that could be connected, or their quantity. In the first case, the database system should be able to ingest data that was heretofore unknown. This argues for a NoSQL database system, i.e. a database system that is schema-less. In this scenario, the database system on field-deployed devices is probably radically different from the database system in the cloud. Field-deployed devices are purpose-specific, so A) they don’t need and wouldn’t benefit from a NoSQL database system, and B) most NoSQL database systems are too resource-hungry to reside on embedded device nodes.

Q3. If we look at the characteristics of a database system for managing device-based data in the IoT, how do they differ from the characteristics of a database system (typically deployed on a server) for analyzing the “big data” generated by myriad devices?

Steve Graves: Again, let’s recognize that field-deployed devices in the IoT are classic embedded systems. In practical terms, that means relatively modest hardware like an ARM, MIPS, PowerPC or Atom processor running at 100s of megahertz, or perhaps 1 ghz if we’re lucky, and with only enough memory to perform its function. Further, it may require a real-time operating system, or at least an embedded operating system that is less resource hungry than a full-on Linux distro. So, for a database system to run in this environment, it will need to have been designed to run in this environment. It isn’t practical to try to shoehorn in a database system that was written on the assumption that CPU cycles and memory are abundant. It may also be the case that the device has little-to-no persistent storage, which mandates an in-memory database.

So a database system for a field-deployed device is going to
1. have a small code size
2. use little stack
3. preferably, allocate no heap memory
4. have no, or minimal, external dependencies (e.g. not link in an extra 1 MB of code from the C run-time library)
5. have built-in ability to replicate data (to a gateway or directly to the cloud)
a. Replication should be “open”, meaning be able to replicate to a different database system
6. Have built-in security features

7. Nice to have:
a. built-in analytics to aggregate data prior to replicating it
b. ability to define the schema
c. ability to operate entirely in memory

A database system for the cloud might benefit from being schema-less, as described previously. It should certainly have pretty elastic scalability. Servers in the cloud are going to have ample resources and robust operating systems. So a database system for the cloud doesn’t need to have a small code size, use a small amount of stack memory, or worry about external dependencies such as the C run-time library. On the contrary, a database system for the cloud is expected to do much more (handle data at scale, execute analytics, etc.) and will, therefore, need ample resources. In fact, this database system should be able to take maximum advantage of the resources available, including being able to scale horizontally (across cores, CPUs, and servers).
In summary, the edge (device-based) DBMS needs to operate in a constrained environment. A cloud DBMS needs to be able to effectively and efficiently utilize the ample resources available to it.

Q4. Why is the ability to define a database schema important (versus a schema-less DBMS, aka NoSQL) for field-deployed devices?

Steve Graves: Field-deployed devices will normally perform a few specific functions (sometimes, just one function). For example, a building automation system manages HVAC, lighting, etc. A livestock management system manages feed, output, and so on. In such systems, the data requirements are well known. The hallmark NoSQL advantage of being able to store data without predefining its structure is unwarranted. The other purported hallmark of NoSQL is horizontal scalability, but this is not a need for field-deployed devices.
Walking away from the relational database model (and its implicit use of a database schema) has serious implications.
A great deal of scientific knowledge has been amassed around the relational database model over the last few decades, and without it developers are completely on their own with respect to enforcing sound data management practices.

In the NoSQL sphere, there is nothing comparable to the relational model (e.g. E.F. Codd’s work) and the mathematical foundation (relational calculus) underpinning it.
There should be overwhelming justification for a decision to not use relational.
In my experience, that justification is absent for data management of field-deployed devices.
A database system that “knows” the data design (via a schema) can more intelligently manage the data. For example, it can manage constraints, domain dependencies, events and much more. And some of the purported inflexibility imposed by a schema can be eliminated if the DBMS supports dynamic DDL (see more details on this in the answer to question Q6, below).

Q5. In your opinion, do IoT aggregation points resemble data lakes?

Steve Graves: The term data lake was originally conceived in the context of Hadoop and map-reduce functionality. In more recent times, the meaning of the term has morphed to become synonymous with big data, and that is how I use the term. Insofar as a gateway can also be an aggregation point, I would not say ‘aggregation points resemble data lakes’ because gateway aggregation points, in all likelihood, will not manage Big Data.

Q6. What are the main technical challenges for database systems used to accommodate new and unforeseen data, for example when a new type of device begins streaming data?

Steve Graves: The obvious challenges are
1. The ability to ingest new data that has a previously unknown structure
2. The ability to execute analytics on #1
3. The ability to integrate analytics on #1 with analytics on previously known data

#1 is handled well by NoSQL DBMSs. But, it might also be handled well by an RDBMS via “dynamic DDL” (dynamic data definition language), e.g. the ability to execute CREATE TABLE, ALTER TABLE, and/or CREATE INDEX statements against an existing database.
To efficiently execute analytics against any data, the structure of the data must eventually be understood.
RDBMS handle this through the database dictionary (the binary equivalent of the data definition language).
But some NoSQL DBMSs handle this through different meta data. For example, the MarkLogic DBMS uses JSON metadata to understand the structure of documents in its document store.
NoSQL DBMSs with no meta data whatsoever put the entire burden on the developers. In other words, since the data is opaque to the DBMS, the application code must read and interpret the content.

Q7. Client/server DBMS architecture vs. in-process DBMSs: which one is more suitable for IoT?

Steve Graves: For edge DBMSs (on constrained devices), an in-process architecture will be more suitable. It requires fewer resources than client/server architecture, and imposes less latency through elimination of inter-process communication. For cloud DBMSs, a client/server architecture will be more suitable. In the cloud environment, resources are not scarce, and the the advantage of being able to scale horizontally will outweigh the added latency associated with client/server.

Qx Anything else you wish to add?

Steve Graves: We feel that eXtremeDB is uniquely positioned for the Internet of Things. Not only have devices and gateways been in eXtremeDB’s wheelhouse for 15 years with over 25 million real world deployments, but the scalability, time series data management, and analytics built into the eXtremeDB server (big data) offering make it an attractive cloud database solution as well. Being able to leverage a single DBMS across devices, gateways and the cloud has obvious synergistic advantages.

———————
Steve Graves is co-founder and CEO of McObject, a company specializing in embedded Database Management System (DBMS) software. Prior to McObject, Steve was president and chairman of Centura Solutions Corporation and vice president of worldwide consulting for Centura Software Corporation.

Resources

Big Data, Analytics, and the Internet of Things, by Mohak Shah, analytics leader and research scientist at Bosch Research, USA.ODBMS.org APRIL 6, 2015

 Privacy considerations & responsibilities in the era of Big Data & Internet of Things, by Ramkumar Ravichandran, Director, Analytics, Visa Inc. ODBMS.org January 8, 2015.

 Securing Your Largest USB-Connected Device: Your Car,BY Shomit Ghose, General Partner, ONSET Ventures, ODBMs.org MARCH 31, 2016.

 eXtremeDB Financial Edition DBMS Sweeps Records in Big Data Benchmark,ODBMS.org JULY 2, 2016

 eXtremeDB in-memory database

 User Experience Design for the Internet of Things

Related Posts

On the Internet of Things. Interview with Colin MahonyODBMS Industry Watch, Published on 2016-03-14

A Grand Tour of Big Data. Interview with Alan MorrisonODBMS Industry Watch, Published on 2016-02-25

On the Industrial Internet of Things. Interview with Leon Guzenda, ODBMS Industry Watch,  January 28, 2016

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/07/on-the-challenges-and-opportunities-of-iot-interview-with-steve-graves/feed/ 0
On the Internet of Things. Interview with Colin Mahony http://www.odbms.org/blog/2016/03/on-the-internet-of-things-interview-with-colin-mahony/ http://www.odbms.org/blog/2016/03/on-the-internet-of-things-interview-with-colin-mahony/#comments Mon, 14 Mar 2016 08:45:56 +0000 http://www.odbms.org/blog/?p=4101

“Frankly, manufacturers are terrified to flood their data centers with these unprecedented volumes of sensor and network data.”– Colin Mahony

I have interviewed Colin Mahony, SVP & General Manager, HPE Big Data Platform. Topics of the interview are: The challenges of the Internet of Things, the opportunities for Data Analytics, the positioning of HPE Vertica and HPE Cloud Strategy.

RVZ

Q1. Gartner says 6.4 billion connected “things” will be in use in 2016, up 30 percent from 2015.  How do you see the global Internet of Things (IoT) market developing in the next years?

Colin Mahony: As manufacturers connect more of their “things,” they have an increased need for analytics to derive insight from massive volumes of sensor or machine data. I see these manufacturers, particularly manufacturers of commodity equipment, with a need to provide more value-added services based on their ability to provide higher levels of service and overall customer satisfaction. Data analytics platforms are key to making that happen. Also, we could see entirely new analytical applications emerge, driven by what consumers want to know about their devices and combine that data with, say, their exercise regimens, health vitals, social activities, and even driving behavior, for full personal insight.
Ultimately, the Internet of Things will drive a need for the Analyzer of Things, and that is our mission.

Q2. What Challenges and Opportunities bring the Internet of Things (IoT)? 

Colin Mahony: Frankly, manufacturers are terrified to flood their data centers with these unprecedented volumes of sensor and network data. The reason? Traditional data warehouses were designed well before the Internet of Things, or, at least before OT (operational technology) like medical devices, industrial equipment, cars, and more were connected to the Internet. So, having an analytical platform to provide the scale and performance required to handle these volumes is important, but customers are taking more of a two- or three-tier approach that involves some sort of analytical processing at the edge before data is sent to an analytical data store. Apache Kafka is also becoming an important tier in this architecture, serving as a message bus, to collect and push that data from the edge in streams to the appropriate database, CRM system, or analytical platform for, as an example, correlation of fault data over months or even years to predict and prevent part failure and optimize inventory levels.

Q3. Big Data: In your opinion, what are the current main demands/needs in the market?

Colin Mahony: All organizations want – and need – to become data-driven organizations. I mean, who wants to make such critical decisions based on half answers and anecdotal data? That said, traditional companies with data stores and systems going back 30-40 years don’t have the same level playing field as the next market disruptor that just received their series B funding and only knows that analytics is the life blood of their business and all their critical decisions.
The good news is that whether you are a 100-year old insurance company or the next Uber or Facebook, you can become a data-driven organization by taking an open platform approach that uses the best tool for the job and can incorporate emerging technologies like Kafka and Spark without having to bolt on or buy all of that technology from a single vendor and get locked in.  Understanding the difference between an open platform with a rich ecosystem and open source software as one very important part of that ecosystem has been a differentiator for our customers.

Beyond technology, we have customers that establish analytical centers of excellence that actually work with the data consumers – often business analysts – that run ad-hoc queries using their preferred data visualization tool to get the insight need for their business unit or department. If the data analysts struggle, then this center of excellence, which happens to report up through IT, collaborates with them to understand and help them get to the analytical insight – rather than simply halting the queries with no guidance on how to improve.

Q4. How do you embed analytics and why is it useful? 

Colin Mahony: OEM software vendors, particularly, see the value of embedding analytics in their commercial software products or software as a service (SaaS) offerings.  They profit by creating analytic data management features or entirely new applications that put customers on a faster path to better, data-driven decision making. Offering such analytics capabilities enables them to not only keep a larger share of their customer’s budget, but at the same time greatly improve customer satisfaction. To offer such capabilities, many embedded software providers are attempting unorthodox fixes with row-oriented OLTP databases, document stores, and Hadoop variations that were never designed for heavy analytic workloads at the volume, velocity, and variety of today’s enterprise. Alternatively, some companies are attempting to build their own big data management systems. But such custom database solutions can take thousands of hours of research and development, require specialized support and training, and may not be as adaptable to continuous enhancement as a pure-play analytics platform. Both approaches are costly and often outside the core competency of businesses that are looking to bring solutions to market quickly.

Because it’s specifically designed for analytic workloads, HPE Vertica is quite different from other commercial alternatives. Vertica differs from OLTP DBMS and proprietary appliances (which typically embed row-store DBMSs) by grouping data together on disk by column rather than by row (that is, so that the next piece of data read off disk is the next attribute in a column, not the next attribute in a row). This enables Vertica to read only the columns referenced by the query, instead of scanning the whole table as row-oriented databases must do. This speeds up query processing dramatically by reducing disk I/O.

You’ll find Vertica as the core analytical engine behind some popular products, including Lancope, Empirix, Good Data, and others as well as many HPE offerings like HPE Operations Analytics, HPE Application Defender, and HPE App Pulse Mobile, and more.

Q5. How do you make a decision when it is more appropriate to “consume and deploy” Big Data on premise, in the cloud, on demand and on Hadoop?

Colin Mahony: The best part is that you don’t need to choose with HPE. Unlike most emerging data warehouses as a service where your data is trapped in their databases when your priorities or IT policies change, HPE offers the most complete range of deployment and consumption models. If you want to spin up your analytical initiative on the cloud for a proof-of-concept or during the holiday shopping season for e-retailers, you can do that easily with HPE Vertica OnDemand.
If your organization finds that due to security or confidentiality or privacy concerns you need to bring your analytical initiative back in house, then you can use HPE Vertica Enterprise on-premises without losing any customizations or disruption to your business. Have petabyte volumes of largely unstructured data where the value is unknown? Use HPE Vertica for SQL on Hadoop, deployed natively on your Hadoop cluster, regardless of the distribution you have chosen. Each consumption model, available in the cloud, on-premise, on-demand, or using reference architectures for HPE servers, is available to you with that same trusted underlying core.

Q6. What are the new class of infrastructures called “composable”? Are they relevant for Big Data?

Colin Mahony: HPE believes that a new architecture is needed for Big Data – one that is designed to power innovation and value creation for the new breed of applications while running traditional workloads more efficiently.
We call this new architectural approach Composable Infrastructure. HPE has a well-established track record of infrastructure innovation and success. HPE Converged Infrastructure, software-defined management, and hyper-converged systems have consistently proven to reduce costs and increase operational efficiency by eliminating silos and freeing available compute, storage, and networking resources. Building on our converged infrastructure knowledge and experience, we have designed a new architecture that can meet the growing demands for a faster, more open, and continuous infrastructure.

Q7. What is HPE Cloud Strategy? 

Colin Mahony: Hybrid cloud adoption is continuing to grow at a rapid rate and a majority of our customers recognize that they simply can’t achieve the full measure of their business goals by consuming only one kind of cloud.
HPE Helion not only offers private cloud deployments and managed private cloud services, but we have created the HPE Helion Network, a global ecosystem of service providers, ISVs, and VARs dedicated to delivering open standards-based hybrid cloud services to enterprise customers. Through our ecosystem, our customers gain access to an expanded set of cloud services and improve their abilities to meet country-specific data regulations.

In addition to the private cloud offerings, we have a strategic and close alliance with Microsoft Azure, which enables many of our offerings, including Haven OnDemand, in the public cloud. We also work closely with Amazon because our strategy is not to limit our customers, but to ensure that they have the choices they need and the services and support they can depend upon.

Q8. What are the advantages of an offering like Vertica in this space?

Colin Mahony: More and more companies are exploring the possibility of moving their data analytics operations to the cloud. We offer HPE Vertica OnDemand, our data warehouse as a service, for organizations that need high-performance enterprise class data analytics for all of their data to make better business decisions now. Built by design to drastically improve query performance over traditional relational database systems, HPE Vertica OnDemand is engineered from the same technology that powers the HPE Vertica Analytics Platform. For organizations that want to select Amazon hardware and still maintain the control over the installation, configuration, and overall maintenance of Vertica for ultimate performance and control, we offer Vertica AMI (Amazon Machine Image). The Vertica AMI is a bring-your-own-license model that is ideal for organizations that want the same experience as on-premise installations, only without procuring and setting up hardware. Regardless of which deployment model to choose, we have you covered for “on demand” or “enterprise cloud” options.

Q9. What is HPE Vertica Community Edition?

Colin Mahony: We have had tens of thousands of downloads of the HPE Vertica Community Edition, a freemium edition of HPE Vertica with all of the core features and functionality that you experience with our core enterprise offering. It’s completely free for up to 1 TB of data storage across three nodes. Companies of all sizes prefer the Community Edition to download, install, set-up, and configure Vertica very quickly on x86 hardware or use our Amazon Machine Image (AMI) for a bring-your-own-license approach to the cloud.

Q10. Can you tell us how Kiva.org, a non-profit organization, uses on-demand cloud analytics to leverage the internet and a worldwide network of microfinance institutions to help fight poverty? 

Colin Mahony: HPE is a major supporter of Kiva.org, a non-profit organization with a mission to connect people through lending to alleviate poverty. Kiva.org uses the internet and a worldwide network of microfinance institutions to enable individuals lend as little as $25 to help create opportunity around the world. When the opportunity arose to help support Kiva.org with an analytical platform to further the cause, we jumped at the opportunity. Kiva.org relies on Vertica OnDemand to reduce capital costs, leverage the SaaS delivery model to adapt more quickly to changing business requirements, and work with over a million lenders, hundreds of field partners and volunteers, across the world. To see a recorded Webinar with HPE and Kiva.org, see here.

Qx Anything else you wish to add?

Colin Mahony: We appreciate the opportunity to share the features and benefits of HPE Vertica as well as the bright market outlook for data-driven organizations. However, I always recommend that any organization that is struggling with how to get started with their analytics initiative to speak and meet with peers to learn best practices and avoid potential pitfalls. The best way to do that, in my opinion, is to visit with the more than 1,000 Big Data experts in Boston from August 29 – September 1st at the HPE Big Data Conference. Click here to learn more and join us for 40+ technical deep-dive sessions.

————-

Colin Mahony, SVP & General Manager, HPE Big Data Platform

Colin Mahony leads the Hewlett Packard Enterprise Big Data Platform business group, which is responsible for the industry leading Vertica Advanced Analytics portfolio, the IDOL Enterprise software that provides context and analysis of unstructured data, and Haven OnDemand, a platform for developers to leverage APIs and on demand services for their applications.
In 2011, Colin joined Hewlett Packard as part of the highly successful acquisition of Vertica, and took on the responsibility of VP and General Manager for HP Vertica, where he guided the business to remarkable annual growth and recognized industry leadership. Colin brings a unique combination of technical knowledge, market intelligence, customer relationships, and strategic partnerships to one of the fastest growing and most exciting segments of HP Software.

Prior to Vertica, Colin was a Vice President at Bessemer Venture Partners focused on investments primarily in enterprise software, telecommunications, and digital media. He established a great network and reputation for assisting in the creation and ongoing operations of companies through his knowledge of technology, markets and general management in both small startups and larger companies. Prior to Bessemer, Colin worked at Lazard Technology Partners in a similar investor capacity.

Prior to his venture capital experience, Colin was a Senior Analyst at the Yankee Group serving as an industry analyst and consultant covering databases, BI, middleware, application servers and ERP systems. Colin helped build the ERP and Internet Computing Strategies practice at Yankee in the late nineties.

Colin earned an M.B.A. from Harvard Business School and a bachelor’s degrees in Economics with a minor in Computer Science from Georgetown University.  He is an active volunteer with Big Brothers Big Sisters of Massachusetts Bay and the Joey Fund for Cystic Fibrosis.

Resources

What’s in store for Big Data analytics in 2016, Steve Sarsfield, Hewlett Packard Enterprise. ODBMS.org, 3 FEB, 2016

What’s New in Vertica 7.2?: Apache Kafka Integration!, HPE, last edited February 2, 2016

Gartner Says 6.4 Billion Connected “Things” Will Be in Use in 2016, Up 30 Percent From 2015, Press release, November 10, 2015

The Benefits of HP Vertica for SQL on Hadoop, HPE, July 13, 2015

Uplevel Big Data Analytics with Graph in Vertica – Part 5: Putting graph to work for your business , Walter Maguire, Chief Field Technologist, HP Big Data Group, ODBMS.org, 2 Nov, 2015

HP Distributed R ,ODBMS.org,  19 FEB, 2015.

Understanding ROS and WOS: A Hybrid Data Storage Model, HPE, October 7, 2015

Related Posts

On Big Data Analytics. Interview with Shilpa LawandeSource: ODBMS Industry Watch, Published on December 10, 2015

On HP Distributed R. Interview with Walter Maguire and Indrajit RoySource: ODBMS Industry Watch, Published on April 9, 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/03/on-the-internet-of-things-interview-with-colin-mahony/feed/ 0
A Grand Tour of Big Data. Interview with Alan Morrison http://www.odbms.org/blog/2016/02/a-grand-tour-of-big-data-interview-with-alan-morrison/ http://www.odbms.org/blog/2016/02/a-grand-tour-of-big-data-interview-with-alan-morrison/#comments Thu, 25 Feb 2016 15:52:44 +0000 http://www.odbms.org/blog/?p=4087

“Leading enterprises have a firm grasp of the technology edge that’s relevant to them. Better data analysis and disambiguation through semantics is central to how they gain competitive advantage today.”–Alan Morrison.

I have interviewed Alan Morrison, senior research fellow at PwC, Center for Technology and Innovation.
Main topic of the interview is how the Big Data market is evolving.

RVZ

Q1. How do you see the Big Data market evolving? 

Alan Morrison: We should note first of all how true Big Data and analytics methods emerged and what has been disruptive. Over the course of a decade, web companies have donated IP and millions of lines of code that serves as the foundation for what’s being built on top.  In the process, they’ve built an open source culture that is currently driving most big data-related innovation. As you mentioned to me last year, Roberto, a lot of database innovation was the result of people outside the world of databases changing what they thought needed to be fixed, people who really weren’t versed in the database technologies to begin with.

Enterprises and the database and analytics systems vendors who serve them have to constantly adjust to the innovation that’s being pushed into the open source big data analytics pipeline. Open source machine learning is becoming the icing on top of that layer cake.

Q2. In your opinion what are the challenges of using Big Data technologies in the enterprise?

Alan Morrison: Traditional enterprise developers were thrown for a loop back in the late 2000s when it comes to open source software, and they’re still adjusting. The severity of the problem differs depending on the age of the enterprise. In our 2012 issue of the Forecast on DevOps, we made clear distinctions between three age classes of companies: legacy mainstream enterprises, pre-cloud enterprises and cloud natives. Legacy enterprises could have systems that are 50 years old or more still in place and have simply added to those. Pre-cloud enterprises are fighting with legacy that’s up to 20 years old. Cloud natives don’t have to fight legacy and can start from scratch with current tech.

DevOps (dev + ops) is an evolution of agile development that focuses on closer collaboration between developers and operations personnel. It’s a successor to agile development, a methodology that enables multiple daily updates to operational codebases and feedback-response loop tuning by making small code changes and see how those change user experience and behaviour. The linked article makes a distinction between legacy, pre-cloud and cloud native enterprises in terms of their inherent level of agility:

Fig1
 Most enterprises are in the legacy mainstream group, and the technology adoption challenges they face are the same regardless of the technology. To build feedback-response loops for a data-driven enterprise in a legacy environment is more complicated in older enterprises. But you can create guerilla teams to kickstart the innovation process.

Q3. Is the Hadoop ecosystem now ready for enterprise deployment at large scale? 

Alan MorrisonHadoop is ten years old at this point, and Yahoo, a very large mature enterprise, has been running Hadoop on 10,000 nodes for years now. Back in 2010, we profiled a legacy mainstream media company who was doing logfile analysis from all of its numerous web properties on a Hadoop cluster quite effectively. Hadoop is to the point where people in their dens and garages are putting it on Raspberry Pi systems. Lots of companies are storing data in or staging it from HDFS. HDFS is a given. MapReduce, on the other hand, has given way to Spark.

HDFS preserves files in their original format immutably, and that’s important. That innovation was crucial to data-driven application development a decade ago. But Hadoop isn’t the end state for distributed storage, and NoSQL databases aren’t either. It’s best to keep in mind that alternatives to Hadoop and its ecosystem are emerging.

I find it fascinating what folks like LinkedIn and Metamarkets are doing data architecture wise with the Kappa architecture–essentially a stream processing architecture that also works for batch analytics, a system where operational and analytical data are one and the same. That’s appropriate for fully online, all-digital businesses.  You can use HDFS, S3, GlusterFS or some other file system along with a database such as Druid. On the transactional side of things, the nascent IPFS (the Interplanetary File System) anticipates both peer-to-peer and the use of blockchains in environments that are more and more distributed. Here’s a diagram we published last year that describes this evolution to date:
Fig2

From PWC Technology Forecast 2015

People shouldn’t be focused on Hadoop, but what Hadoop has cleared a path for that comes next.

Q4. What are in your opinion the most innovative Big Data technologies?

Alan Morrison: The rise of immutable data stores (HDFS, Datomic, Couchbase and other comparable databases, as well as blockchains) was significant because it was an acknowledgement that data history and permanence matters, the technology is mature enough and the cost is low enough to eliminate the need to overwrite. These data stores also established that eliminating overwrites also eliminates a cause of contention. We’re moving toward native cloud and eventually the P2P fog (localized, more truly distributed computing) that will extend the footprint of the cloud for the Internet of things.

Unsupervised machine learning has made significant strides in the past year or two, and it has become possible to extract facts from unstructured data, building on the success of entity and relationship extraction. What this advance implies is the ability to put humans in feedback loops with machines, where they let machines discover the data models and facts and then tune or verify those data models and facts.

In other words, large enterprises now have the capability to build their own industry- and organization-specific knowledge graphs and begin to develop cognitive or intelligent apps on top those knowledge graphs, along the lines of what Cirrus Shakeri of Inventurist envisions.

Fig3

From Cirrus Shakeri, “From Big Data to Intelligent Applications,”  post, January 2015 

At the core of computable semantic graphs (Shakeri’s term for knowledge graphs or computable knowledge bases) is logically consistent semantic metadata. A machine-assisted process can help with entity and relationship extraction and then also ontology generation.

Computability = machine readability. Semantic metadata–the kind of metadata cognitive computing apps use–can be generated with the help of a well-designed and updated ontology. More and more, these ontologies are uncovered in text rather than hand built, but again, there’s no substitute for humans in the loop. Think of the process of cognitive app development as a continual feedback-response loop process. The use of agents can facilitate the construction of these feedback loops.

Q5. In a recent note Carl Olofson, Research Vice President, Data Management Software Research, IDC, predicted the RIP of “Big Data” as a concept. What is your view on this?

Alan Morrison: I agree the term is nebulous and can be misleading, and we’ve had our fill of it. But that doesn’t mean it won’t continue to be used. Here’s how we defined it back in 2009:

Big Data is not a precise term; rather, it is a characterization of the never-ending accumulation of all kinds of data, most of it unstructured. It describes data sets that are growing exponentially and that are too large, too raw, or too unstructured for analysis using relational database techniques. Whether terabytes or petabytes, the precise amount is less the issue than where the data ends up and how it is used. (See https://www.pwc.com/us/en/technology-forecast/assets/pwc-tech-forecast-issue3-2010.pdf, pg. 6.)

For that issue of the Forecast, we focused on how Hadoop was being piloted in enterprises and the ecosystem that was developing around it. Hadoop was the primary disruptive technology, as well as NoSQL databases. It helps to consider the data challenge of the 2000s and how relational databases and enterprise data warehousing techniques were falling short at that point.  Hadoop has reduced the cost of analyzing data by an order of magnitude and allows processing of very large unstructured datasets. NoSQL has made it possible to move away from rigid data models and standard ETL.

“Big Data” can continue to be shorthand for petabytes of unruly, less structured data. But why not talk about the system instead of just the data? I like the term that George Gilbert of Wikibon latched on to last year. I don’t know if he originated it, but he refers to the System of Intelligence. That term gets us beyond the legacy, pre-web “business intelligence” term, more into actionable knowledge outputs that go beyond traditional reporting and into the realm of big data, machine learning and more distributed systems. The Hadoop ecosystem, other distributed file systems, NoSQL databases and the new analytics capabilities that rely on them are really at the heart of a System of Intelligence.

Q6. How many enterprise IT systems do you think we will need to interoperate in the future? 

Alan Morrison: I like Geoffrey Moore‘s observations about a System of Engagement that emerged after the System of Record, and just last year George Gilbert was adding to that taxonomy with a System of Intelligence. But you could add further to that with a System of Collection that we still need to build. Just to be consistent, the System of Collection articulates how the Internet of Things at scale would function on the input side. The System of Engagement would allow distribution of the outputs. For the outputs of the System of Collection to be useful, that system will need to interoperate in various ways with the other systems.

To summarize, there will actually be four enterprise IT systems that will need to interoperate, ultimately. Three of these exist, and one still needs to be created.

The fuller picture will only emerge when this interoperation becomes possible.

Q7. What are the  requirements, heritage and legacy of such systems?

Alan Morrison: The System of Record (RDBMSes) still relies on databases and tech with their roots in the pre-web era. I’m not saying these systems haven’t been substantially evolved and refined, but they do still reflect a centralized, pre-web mentality. Bitcoin and Blockchain make it clear that the future of Systems of Record won’t always be centralized. In fact, microtransaction flows in the Internet of Things at scale will depend on the decentralized approaches,  algorithmic transaction validation, and immutable audit trail creation which blockchain inspires.

The Web is only an interim step in the distributed system evolution. P2P systems will eventually complemnt the web, but they’ll take a long time to kick in fully–well into the next decade. There’s always the S-curve of adoption that starts flat for years. P2P has ten years of an installed base of cloud tech, twenty years of web tech and fifty years plus of centralized computing to fight with. The bitcoin blockchain seems to have kicked P2P in gear finally, but progress will be slow through 2020.

The System of Engagement (requiring Web DBs) primarily relies on Web technnology (MySQL and NoSQL) in conjunction with traditional CRM and other customer-related structured databases.

The System of Intelligence (requiring Web file systems and less structured DBs) primarily relies on NoSQL, Hadoop, the Hadoop ecosystem and its successors, but is built around a core DW/DM RDBMS analytics environment with ETLed structured data from the System of Record and System of Engagement. The System of Intelligence will have to scale and evolve to accommodate input from the System of Collection.

The System of Collection (requiring distributed file systems and DBs) will rely on distributed file system successors to Hadoop and HTTP such as IPFS and the more distributed successors to MySQL+ NoSQL. Over the very long term, a peer-to-peer architecture will emerge that will become necessary to extend the footprint of the internet of things and allow it to scale.

Q8. Do you already have the piece parts to begin to build out a 2020+ intersystem vision now?

Alan Morrison: Contextual, ubiquitous computing is the vision of the 2020s, but to get to that, we need an intersystem approach. Without interoperation of the four systems I’ve alluded to, enterprises won’t be able to deliver the context required for competitive advantage. Without sufficient entity and relationship disambiguation via machine learning in machine/human feedback loops, enterprises won’t be able to deliver the relevance for competitive advantage.

We do have the piece parts to begin to build out an intersystem vision now. For example, interoperation is a primary stumbling block that can be overcome now. Middleware has been overly complex and inadequate to the current-day task, but middleware platforms such as EnterpriseWeb are emerging that can reach out as an integration fabric for all systems, up and down the stack. Here’s how the integration fabric becomes an essential enabler for the intersystem approach:

Fig4
PwC, 2015

A lot of what EnterpriseWeb (full disclosure: a JBR partner of PwC) does hinges on the creation and use of agents and semantic metadata that enable the data/logic virtualization. That’s what makes the desiloing possible. One of the things about the EnterpriseWeb platform is that it’s a full stack virtual integration and application platform, using methods that have data layer granularity, but process layer impact. Enterprise architects can tune their models and update operational processes at the same time. The result: every change is model-driven and near real-time. Stacks can all be simplified down to uniform, virtualized composable entities using enabling technologies that work at the data layer. Here’s how they work:

Fig5
PwC, 2015

So basically you can do process refinement across these systems, and intersystem analytics views thus also become possible.

Qx anything else you wish to add? 

Alan Morrison: We always quote science fiction writer William Gibson, who said,

“The future is already here — it’s just not very evenly distributed.”

Enterprises would do best to remind themselves what’s possible now and start working with it. You’ve got to grab onto that technology edge and let it pull you forward. If you don’t understand what’s possible, most relevant to your future business success and how to use it, you’ll never make progress and you’ll always be reacting to crises. Leading enterprises have a firm grasp of the technology edge that’s relevant to them. Better data analysis and disambiguation through semantics is central to how they gain competitive advantage today.

We do a ton of research to get to the big picture and find the real edge, where tech could actually have a major business impact. And we try to think about what the business impact will be, rather than just thinking about the tech. Most folks who are down in the trenches are dismissive of the big picture, but the fact is they aren’t seeing enough of the horizon to make an informed judgement. They are trying to use tools they’re familiar with to address problems the tools weren’t designed for. Alongside them should be some informed contrarians and innovators to provide balance and get to a happy medium.

That’s how you counter groupthink in an enterprise. Executives need to clear a path for innovation and foster a healthy, forward-looking, positive and tolerant mentality. If the workforce is cynical, that’s an indication that they lack a sense of purpose or are facing systemic or organizational problems they can’t overcome on their own.

—————–
Alan Morrison (@AlanMorrison) is a senior research fellow at PwC, a longtime technology trends analyst and an issue editor of the firm’s Technology Forecast

Resources

Data-driven payments. How financial institutions can win in a networked economy, BY, Mark Flamme, Partner; Kevin Grieve, Partner;  Mike Horvath, Principal Strategy&. FEBRUARY 4, 2016, ODBMS.org

The rise of immutable data stores, By Alan Morrison, Senior Manager, PwC Center for technology and innovation (CTI), OCTOBER 9, 2015, ODBMS.org

The enterprise data lake: Better integration and deeper analytics, By Brian Stein and Alan Morrison, PwC, AUGUST 20, 2014 ODBMS.org

Related Posts

On the Industrial Internet of Things. Interview with Leon Guzenda , ODBMS Industry Watch, January 28, 2016

On Big Data and Society. Interview with Viktor Mayer-Schönberger , ODBMS Industry Watch, January 8, 2016

On Big Data Analytics. Interview with Shilpa Lawande , ODBMS Industry Watch, December 10, 2015

On Dark Data. Interview with Gideon Goldin , ODBMS Industry Watch, November 16, 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/02/a-grand-tour-of-big-data-interview-with-alan-morrison/feed/ 0
On Database Resilience. Interview with Seth Proctor http://www.odbms.org/blog/2015/03/interview-seth-proctor/ http://www.odbms.org/blog/2015/03/interview-seth-proctor/#comments Tue, 17 Mar 2015 16:11:55 +0000 http://www.odbms.org/blog/?p=3845

“In normal English usage the word resilience is taken to mean the power to resume original shape after compression; in the context of data base management the term data base resilience is defined as the ability to return to a previous state after the occurrence of some event or action which may have changed that state.
from “P. A Dearnley, School of Computing Studies, University of East Anglia, Norwich NR4 7TJ , 1975

On the topic database resilience, I have interviewed Seth Proctor, Chief Technology Officer at NuoDB.

RVZ

Q1. When is a database truly resilient?

Seth Proctor: That is a great question, and the quotation above is a good place to start. In general, resiliency is about flexibility. It’s kind of the view that you should bend but not break. Individual failures (server crashes, disk wear, tripping over power cables) are inevitable but don’t have to result in systemic outages.
In some cases that means reacting to failure in a way that’s non-disruptive to the overall system.
The redundant networks in modern airplanes are a great example of this model. Other systems take a deeper view, watching global state to proactively re-route activity or replace components that may be failing. This is the model that keeps modern telecom networks running reliably. There are many views applied in the database world, but to me a resilient database is one that can react automatically to likely or active failures so that applications continue operating with full access to their data even as failures occur.

Q2. Is database resilience the same as disaster recovery?

Seth Proctor: I don’t believe it is. In traditional systems there is a primary site where the database is “active” and updates are replicated from there to other sites. In the case of failure to the primary site, one of the replicas can take over. Maintaining that replica (or replicas) is usually the key part of Disaster Recovery.
Sometimes that replica is missing the latest changes, and usually the act of “failing over” to a replica involves some window where the database is unavailable. This leads to operational terms like “hot stand-by” where failing over is faster but still delayed, complicated and failure-prone.

True resiliency, in my opinion, comes from systems that are designed to always be available even as some components fail. Reacting to failure efficiently is a key requirement, as is survival in case of complete site loss, so replicating data to multiple locations is critical to resiliency. At a minimum, however, a resilient data management solution cannot lose data (sort of “primum non nocere” for servers) and must be able to provide access to all data even as servers fail. Typical Disaster Recovery solutions on their own are not sufficient. A resilient solution should also be able to continue operations in the face of expected failures: hardware and software upgrades, network updates and service migration.
This is especially true as we push out to hybrid cloud deployments.

Q3. What are the database resilience requirements and challenges, especially in this era of Big Data?

Seth Proctor: There is no one set of requirements since each application has different goals with different resiliency needs. Big Data is often more about speeds and volume while in the operational space correctness, latency and availability are key. For instance, if you’re handling high-value bank transactions you have different needs than something doing weekly trend-analysis on Internet memes. The great thing about “the cloud” is the democratization of features and the new systems that have evolved around scale-out architectures. Things like transactional consistency were originally designed to make failures simpler and systems more resilient; as consistent data solutions scale out in cloud models it’s simpler to make any application resilient without sacrificing performance or increasing complexity.

That said, I look for a couple of key criteria when designing with resiliency in mind. The first is a distributed architecture, the foundation for any system to survive individual failure but remain globally available.
Ideally this provides a model where an application can continue operating even as arbitrary components fail. Second is the need for simple provisioning & monitoring. Without this, it’s hard to react to failures in an automatic or efficient fashion, and it’s almost impossible to orchestrate normal upgrade processes without down-time. Finally, a database needs to have a clear model for how the same data is kept in multiple locations and what the failure modes are that could result in any loss. These requirements also highlight a key challenge: what I’ve just described are what we expect from cloud infrastructure, but are pushing the limits of what most shared-nothing, sharded or vertically-scaled data architectures offer.

Q4. What is the real risk if the database goes offline?

Seth Proctor: Obviously one risk is the ripple effect it has to other services or applications.
When a database fails it can take with it core services, applications or even customers. That can mean lost revenue or opportunity and it almost certainly means disruption across an organization. Depending on how a database goes offline, the risk may also extend to data loss, corruption, or both. Most databases have to trade-off certain elements of latency against guaranteed durability, and it’s on failure that you pay for that choice. Sometimes you can’t even sort out what information was lost. Perhaps most dangerous, modern deployments typically create the illusion of a data management service by using multiple databases for DR, scale-out etc. When a single database goes offline you’re left with a global service in an unknown state with gaps in its capabilities. Orchestrating recovery is often expensive, time-consuming and disruptive to applications.

Q5. How are customers solving the continuous availability problem today?

Seth Proctor: Broadly, database availability is tackled in one of two fashions. The first is by running with many redundant, individual, replicated servers so that any given server can fail or be taken offline for maintenance as needed. Putting aside the complexity of operating so many independent services and the high infrastructure costs, there is no single view of the system. Data is constantly moving between services that weren’t designed with this kind of coordination in mind so you have to pay close attention to latencies, backup strategies and visibility rules for your applications. The other approach is to use a database that has forgone consistency, making a lot of these pieces appear simpler but placing the burden that might be handled by the database on the application instead. In this model each application needs to be written to understand the specifics of the availability model and in exchange has a service designed with redundancy.

Q6. Now that we are in the Cloud era, is there a better way?

Seth Proctor: For many pieces of the stack cloud architectures result in much easier availability models. For the database specifically, however, there are still some challenges. That said, I think there are a few great things we get from the cloud design mentality that are rapidly improving database availability models. The first is an assumption about on-demand resources and simplicity of spinning up servers or storage as needed. That makes reacting to failure so much easier, and much more cost-effective, as long as the database can take advantage of it. Next is the move towards commodity infrastructure. The economics certainly make it easier to run redundantly, but commodity components are likely to fail more frequently. This is pushing systems design, making failure tests critical and generally putting more people into the defensive mind-set that’s needed to build for availability. Finally, of course, cloud architectures have forced all of us to step back and re-think how we build core services, and that’s leading to new tools designed from the start with this point of view. Obviously that’s one of the most basic elements that drives us at NuoDB towards building a new kind of database architecture.

Q7. Can you share methodologies for avoiding single points of failure?

Seth Proctor: For sure! The first thing I’d say is to focus on layering & abstraction.
Failures will happen all the time, at every level, and in ways you never expect. Assume that you won’t test all of them ahead of time and focus on making each little piece of your system clear about how it can fail and what it needs from its peers to be successful. Maybe it’s obvious, but to avoid single points of failure you need components that are independent and able to stand-in for each other. Often that means replicating data at lower-levels and using DNS or load-balancers at a higher-level to have one name or endpoint map to those independent components. Oh, also, decouple your application logic as much as possible from your operational model. I know that goes against some trends, but really, if your application has deep knowledge of how some service is deployed and running it makes it really hard to roll with failures or changes to that service.

Q8. What’s new at NuoDB?

Seth Proctor: There are too many new things to capture it all here!
For anyone who hasn’t looked at us, NuoDB is a relational database build against a fundamentally new, distributed architecture. The result is ACID semantics, support for standard SQL (joins, indexes, etc.) and a logical view of a single database (no sharding or active/passive models) designed for resiliency from the start.
Rather than talk about the details here I’d point people at a white paper (Note of the Editor: registration required) we’ve just published on the topic.
Right now we’re heavily focused on a few key challenges that our enterprise customers need to solve: migrating from vertical scale to cloud architectures, retaining consistency and availability and designing for on-demand scale and hybrid deployments. Really important is the need for global scale, where a database scales to multiple data centers and multiple geographies. That brings with it all kinds of important requirements around latencies, failure, throughput, security and residency. It’s really neat stuff.

Q9- How does it differentiate with respect to other NoSQL and NewSQL databases?

Seth Proctor: The obvious difference to most NoSQL solutions is that NuoDB supports standard SQL, transactional consistency and all the other things you’d associate with an RDBMS.
Also, given our focus on enterprise use-cases, another key difference is the strong baseline with security, backup, analysis etc. In the NewSQL space there are several databases that run in-memory, scale-out and provide some kind of SQL support. Running in-memory often means placing all data in-memory, however, which is expensive and can lead to single points of failure and delays on recovery. Also, there are few that really support the arbitrary SQL that enterprises need. For instance, we have customers running 12-way joins or transactions that last hours and run thousands of statements.
These kinds of general-purpose capabilities are very hard to scale on-demand but they are the requirement for getting client-server architectures into the cloud, which is why we’ve spent so long focused on a new architectural view.
One other key difference is our focus on global operations. There are very few people trying to take a single, logical database and distribute it to multiple geographies without impacting consistency, latency or security.

Qx Anything else you wish to add?

Seth Proctor: Only that this was a great set of questions, and exactly the direction I encourage everyone to think about right now. We’re in a really exciting time between public clouds, new software and amazing capacity from commodity infrastructure. The hard part is stepping back and sorting out all the ways that systems can fail.
Architecting with resiliency as a goal is going to get more commonplace as the right foundational services mature.
Asking yourself what that means, what failures you can tolerate and whether you’re building systems that can grow alongside those core services is the right place to be today. What I love about working in this space today is that concepts like resilient design, until recently a rarefied approach, are accessible to everyone.
Anyone trying to build even the simplest application today should be asking these questions and designing from the start with concepts like resiliency front and center.

———–
Seth Proctor, Chief Technology Officer, NuoDB

Seth has 15+ years of experience in the research, design and implementation of scalable systems. That experience includes work on distributed computing, networks, security, languages, operating systems and databases all of which are integral to NuoDB. His particular focus is on how to make technology scale and how to make users scale effectively with their systems.

Prior to NuoDB Seth worked at Nokia on their private cloud architecture. Before that he was at Sun Microsystems Laboratories and collaborated with several product groups and universities. His previous work includes contributions to the Java security framework, the Solaris operating system and several open source projects, in addition to the design of new distributed security and resource management systems. Seth developed new ways of looking at distributed transactions, caching, resource management and profiling in his contributions to Project Darkstar. Darkstar was a key effort at Sun which provided greater insights into how to distribute databases.

Seth holds eight patents for his cutting edge work across several technical disciplines. He has several additional patents awaiting approval related to achieving greater database efficiency and end-user agility.

Resources

– Hybrid Transaction and Analytical Processing with NuoDB

– NuoDB Larks Release 2.2 

Various Resources on NuoDB.

Related Posts

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2015/03/interview-seth-proctor/feed/ 0
What are the challenges for modern Data Centers? Interview with David Gorbet. http://www.odbms.org/blog/2014/03/data-centers-challenges-interview-david-gorbet/ http://www.odbms.org/blog/2014/03/data-centers-challenges-interview-david-gorbet/#comments Tue, 25 Mar 2014 07:51:55 +0000 http://www.odbms.org/blog/?p=3127

“The real problem here is the word “silo.” To answer today’s data challenges requires a holistic approach. Your storage, network and compute need to work together.”–David Gorbet.

What are the challenges for modern data centers? On this topic I have interviewed David Gorbet, Vice President of Engineering at MarkLogic.
RVZ

Q1. Data centers are evolving to meet the demands and complexities imposed by increasing business requirements. What are the main challenges?

David Gorbet: The biggest business challenge is the rapid pace of change in both available data and business requirements. It’s no longer acceptable to spend years designing a data application, or the infrastructure to run it. You have to be able to iterate on functionality quickly. This means that your applications need to be developed in a much more agile manner, but you also need to be able to reallocate your infrastructure dynamically to the most pressing needs. In the era of Big Data this problem is exacerbated. The increasing volume and complexity of data under management is stressing both existing technologies and IT budgets. It’s not just a matter of scale, although traditional “scale-up” technologies do become very expensive as data volumes grow. It’s also a matter of complexity of data. Today a lot of data has a mix of structured and unstructured components, and the traditional solution to this problem is to split the structured components into an RDBMS, and use a search technology for the unstructured components. This creates additional complexity in the infrastructure, as different technology stacks are required for what really should be components of the same data.

Traditional technologies for data management are not agile. You have to spend an inordinate amount of time designing schemas and planning indexing strategies, both of which require pretty-much full knowledge of the data and query patterns you need to provide the application value. This has to be done before you can even load data. On the infrastructure side, even if you’ve embraced cloud technologies like virtualization, it’s unlikely you’re able to make good use of them at the data layer. Most database technologies are not architected to allow elastic expansion or contraction of capacity or compute power, which makes it hard to achieve many of the benefits (and cost savings) of cloud technologies.

To solve these problems you need to start thinking differently about your data center strategy. You need to be thinking about a data-centered data center, versus today’s more application-centered model.

Q2. You talked about a “data-centered” data center? What is it? and what is the difference with respect to a classical Data-Warehouse?

David Gorbet: To understand what I mean by “data-centered” data center, you have to think about the alternative, which is more application-centered. Today, if you have data that’s useful, you build an application or put it in a data warehouse to unlock its value. These are database applications, so you need to build out a database to power them. This database needs a schema, and that schema is optimized for the application. To build this schema, you need to understand both the data you’ll be using, and the queries that the application requires.
So you have to know in advance everything the application is going to do before you can build anything. What’s more, you then have to ETL this data from wherever it lives into the application-specific database.

Now, if you want another application, you have to do the same thing. Pretty soon, you have hundreds of data stores with data duplicated all over the place. Actually, it’s not really duplicated, it’s data derived from other data, because as you ETL the data you change its form losing some of the context and combining what’s left with bits of data from other sources. That’s even worse than straight-up duplication because provenance is seldom retained through this process, so it’s really hard to tell where data came from and trace it back to its source. Now imagine that you have to correct some data.
Can you be sure that the correction flowed through to every downstream system? Or what if you have to delete data due to a privacy issue, or change security permissions on data? Even with “small data” this is complicated, but it’s much harder and costlier with high volumes of data.

A “data-centered” data center is one that is focused on the data, its use, and its governance through its lifecycle as the primary consideration. It’s architected to allow a single management and governance model, and to bring the applications to the data, rather than copying data to the applications. With the right technologies, you can build a data-centered data center that minimizes all the data duplication, gives you consistent data governance, enables flexibility both in application development over the data and in scaling up and down capacity to match demand, allowing you to manage your data securely and cost-effectively throughout its lifecycle.

Q3. Data center resources are typically stored in three silos: compute, storage and network: is this a problem?

David Gorbet: It depends on your technology choices. Some data management technologies require direct-attached storage (DAS), so obviously you can’t manage storage separately with that kind of technology. Others can make use of either DAS or shared storage like SAN or NAS.
With the right technology, it’s not necessarily a problem to have storage managed independently from compute.
The real problem here is the word “silo.” To answer today’s data challenges requires a holistic approach. Your storage, network and compute need to work together.

Your question could also apply to application architectures. Traditionally, applications are built in a three-tiered architecture, with a DBMS for data management, an application server for business logic, and a front-end client where the UI lives. There are very good reasons for this architecture, and I believe it’s likely to be the predominant model for years to come. But even though business logic is supposed to reside in the app server, every enterprise DBMS supports stored procedures, and these are commonly used to leverage compute power near the data for cases where it would be too slow and inefficient to move data to the middle tier. Increasingly, enterprise DBMSes also have sophisticated built-in functions (and in many cases user-defined functions) to make it easy to do things that are most efficiently done right where the data lives. Analytic aggregate calculations are a good example of this. Compute doesn’t just reside in the middle tier.

This is nothing new, so why am I bringing it up? Because as data volumes grow larger, the problem of moving data out of the DBMS to do something with it is going to get a lot worse. Consider for example the problem faced by the National Cancer Institute. The current model for institutions wanting to do research based on genomic data is to download a data set and analyze it. But by the end of 2014, the Cancer Genome Atlas is expected to have grown from less than 500 TB to 2.5 PB. Just downloading 2.5 PB, even over 10-gigabit network would take almost a month.

The solution? Bring more compute to the data. The implication? Twofold: First, methods for narrowing down data sets prior to acting on them are critical. This is why search technology is fast becoming a key feature of a next-generation DBMS. Search is the query language for unstructured data, and if you have complex data with a mix of structured and unstructured components, you need to be able to mix search and query seamlessly. Second, DBMS technologies need to become much more powerful so that they can execute sophisticated programs and computations efficiently where the data lives, scoped in real-time to a search that can narrow the input set down significantly. That’s the only way this stuff is going to get fast enough to happen in real-time. Another way of putting this is that the “M” in DBMS is going to increase in scope. It’s not enough just to store and retrieve data. Modern DBMS technology needs to be able to do complex, useful computations on it as well.

Q4. How do you build such a “data-centered” data center?

David Gorbet: First you need to change your mindset. Think about the data as the center of everything. Think about managing your data in one place, and bringing the application to the data by exposing data services off your DBMS. The implications for how you architect your systems are significant. Think service-oriented architectures and continuous deployment models.

Next, you need the right technology stack. One that can provide application functionality for transactions, search and discovery, analytics, and batch computation with a single governance and scale model. You need a storage system that gives great SLAs on high-value data and great TCO on lower-value data, without ETL. You need the ability to expand and contract compute power to serve the application needs in real time without downtime, and to run this infrastructure on premises or in the cloud.

You need the ability to manage data throughout its lifecycle, to take it offline for cost savings while leaving it available for batch analytics, and to bring it back online for real-time search, discovery or analytics within minutes if necessary. To power applications, you need the ability to create powerful, performant and secure data services and expose them right from where the data lives, providing the data in the format needed by your application on the fly.
We call this “schema on read.”

Of course all this has to be enterprise class, with high availability, disaster recovery, security, and all the enterprise functionality your data deserves, and it has to fit in your shrinking IT budget. Sounds impossible, but the technology exists today to make this happen.

Q5. For what kind of mission critical apps is such a “data-centered” data center useful?

David Gorbet: If you have a specific application that uses specific data, and you won’t need to incorporate new data sources to that application or use that data for another application, then you don’t need a data-centered data center. Unfortunately, I’m having a hard time thinking of such an application. Even the dull line of business apps don’t stand alone anymore. The data they create and maintain is sent to a data warehouse for analysis.
The new mindset is that all data is potentially valuable, and that isn’t just restricted to data created in-house.
More and more data comes from outside the organization, whether in the form of reference data, social media, linked data, sensor data, log data… the list is endless.

A data-centered data center strategy isn’t about a specific application or application type. It’s about the way you have to think about your data in this new era.

Q6. How Hadoop fits into this “data-centered” data center?

David Gorbet: Hadoop is a key enabling technology for the data-centered data center. HDFS is a great file system for storing loads of data cheaply.
I think of it as the new shared storage infrastructure for “big data.” Now HDFS isn’t fast, so if you need speed, you may need NAS, SAN, or even DAS or SSD. But if you have a lot of data, it’s going to be much cheaper to store it in HDFS than in traditional data center storage technologies. Hadoop MapReduce is a great technology for batch analytics. If you want to comb through a lot of data and do some pretty sophisticated stuff to it, this is a good way to do it. The downside to MapReduce is that it’s for batch jobs. It’s not real-time.

So Hadoop is an enabling technology for a data-centered data center, but it needs to be complemented with high-performance storage technologies for data that needs this kind of SLA, and more powerful analytic technologies for real-time search, discovery and analysis. Hadoop is not a DBMS, so you also need a DBMS with Hadoop to manage transactions, security, real-time query, etc.

Q7. What are the main challenges when designing an ETL strategy?

David Gorbet: ETL is hard to get right, but the biggest challenge is maintaining it. Every app has a v2, and usually this means new queries that require new data that needs a new schema and revised ETL. ETL also just fundamentally adds complexity to a solution.
It adds latency since many ETL jobs are designed to run in batches. It’s hard to track provenance of data through ETL, and it’s hard to apply data security and lifecycle management rules through ETL. This isn’t the fault of ETL or ETL tools.
It’s just that the model is fundamentally complex.

Q8. With Big Data analytics you don’t know in advance what data you’re going to need (or get in the future). What is the solution to this problem?

David Gorbet: This is a big problem for relational technologies, where you need to design a schema that can fit all your data up front.
The best approach here is to use a technology that does not require a predefined schema, and that allows you to store different entities with different schemas (or no schema) together in the same database and analyze them together.
A document database, which is a type of NoSQL database, is great for this, but be careful which one you choose because some NoSQL databases don’t do transactions and some don’t have the indexing capability you need to search and query the data effectively.
Another trend is to use Semantic Web technology. This involves modeling data as triples, which represent assertions with a subject, a predicate, and an object.
Like “This derivative (subject) is based on (predicate) this underlying instrument (object).”
It turns out you can model pretty much any data that way, and you can invent new relationships (predicates) on the fly as you need them.
No schema required. It’s also easy to relate data entities together, since triples are ideal for modeling relationships. The challenge with this approach is that there’s still quite a bit of thought required to figure out the best way to represent your data as triples. To really make it work, you need to define rules about what predicates you’re going to allow and what they mean so that data is modeled consistently.

Q9. What is the cost to analyze a terabyte of data?

David Gorbet: That depends on what technologies you’re using, and what SLAs are required on that data.
If you’re ingesting new data as you analyze, and you need to feed some of the results of the analysis back to the data in real time, for example if you’re analyzing risk on derivatives trades before confirming them, and executing business rules based on that, then you need fast disk, a fair amount of compute power, replicas of your data for HA failover, and additional replicas for DR. Including compute, this could cost you about $25,000/TB.
If your data is read-only and your analysis does not require high-availability, for example a compliance application to search those aforementioned derivatives transactions, you can probably use cheaper, more tightly packed storage and less powerful compute, and get by with about $4,000/TB. If you’re doing mostly batch analytics and can use HDFS as your storage, you can do this for as low as $1,500/TB.

This wide disparity in prices is exactly why you need a technology stack that can provide real-time capability for data that needs it, but can also provide great TCO for the data that doesn’t. There aren’t many technologies that can work across all these data tiers, which is why so many organizations have to ETL their data out of their transactional system to an analytic or archive system to get the cost savings they need. The best solution is to have a technology that can work across all these storage tiers and can manage migration of data through its lifecycle across these tiers seamlessly.
Again, this is achievable today with the right technology choices.

———————————-
David Gorbet, Vice President, Engineering, MarkLogic.
David brings two decades of experience bringing to market some of the highest-volume applications and enterprise software in the world. David has shipped dozens of releases of business and consumer applications, server products and services ranging from open source to large-scale online services for businesses, and twice has helped start and grow billion-dollar software products.

Prior to MarkLogic, David helped pioneer Microsoft’s business online services strategy by founding and leading the SharePoint Online team. In addition to SharePoint Online, David has held a number of positions at Microsoft and elsewhere with a number of products, including Microsoft Office, Extricity B2Bi server software, and numerous incubation products.

David holds a Bachelor of Applied Science degree in Systems Design Engineering with an additional major in Psychology from the University of Waterloo, and an MBA from the University of Washington Foster School of Business.

Related Posts

On Linked Data. Interview with John Goodwin. ODBMS Industry September 1, 2013

On NoSQL. Interview with Rick Cattell. ODBMS Industry Watch August 19, 2013

Resources

Got Loss? Get zOVN!
Authors: Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg and Mitch Gusat. IBM Research – Zurich Research Laboratory.
Abstract: Datacenter networking is currently dominated by two major trends. One aims toward lossless, flat layer-2 fabrics based on Converged Enhanced Ethernet or InfiniBand, with ben- efits in efficiency and performance.

F1: A Distributed SQL Database That Scales
Authors: Jeff Shute, Radek Vingralek, Eric Rollins, Stephan Ellner, Traian Stancescu, Bart Samwel, Mircea Oancea, John Cieslewicz, Himani Apte, Ben Handy, Kyle Littlefield, Ian Rae*. Google, Inc., *University of Wisconsin-Madison
Abstract: F1 is a distributed relational database system built at Google to support the AdWords business.

Events

David Gorbet will be speaking at MarkLogic World in San Francisco from April 7-10, 2014.

ODBMS.org on Twitter: @odbmsorg

]]>
http://www.odbms.org/blog/2014/03/data-centers-challenges-interview-david-gorbet/feed/ 0
On SQL and NoSQL. Interview with Dave Rosenthal http://www.odbms.org/blog/2014/03/dave-rosenthal/ http://www.odbms.org/blog/2014/03/dave-rosenthal/#comments Tue, 18 Mar 2014 16:04:45 +0000 http://www.odbms.org/blog/?p=2932

“Despite the obvious shared word ‘transaction’ and the canonical example of a database transaction which modifies multiple bank accounts, I don’t think that database transactions are particularly relevant to financial applications.”–Dave Rosenthal.

On SQL and NoSQL, I have interviewed Dave Rosenthal CEO of FoundationDB.

RVZ

Q1. What are the suggested criteria for users when they need to choose between durability for lower latency, higher throughput and write availability?

Dave Rosenthal: There is a tradeoff in available between commit latency and durability–especially in distributed databases. At one extreme a database client can just report success immediately (without even talking to the database server) and buffer the writes in the background. Obviously, that hides latency well, but you could lose a suffix of transactions. At the other extreme, you can replicate writes across multiple machines, fsync them on each of the machines, and only then report success to the client.

FoundationDB is optimized to provide good performance in its default setting, which is the safest end of that tradeoff.

Usually, if you want some reasonable amount of durability guarantee, you are talking about a commit latency of small constant factor times the network latency. So, the real latency issues come with databases spanning multiple data centers. In that case FoundationDB users are able to choose whether or not they want durability guarantees in all data centers before commit (increasing commitment latencies), which is our default setting, or whether they would like to relax durability guarantees by returning a commit when the data is fsync’d to disk in just one datacenter.

All that said, in general, we think that the application is usually a more appropriate place to try to hide latency than the database.

Q2. Justin Sheehy of Basho in an interview said [1] “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense”. What is your opinion on this?

Dave Rosenthal: Yes, we totally agree with Justin. Despite the obvious shared word ‘transaction’ and the canonical example of a database transaction which modifies multiple bank accounts, I don’t think that database transactions are particularly relevant to financial applications. In fact, true ACID transactions are way more broadly important than that. They give you the ability to build abstractions and systems that you can provide guarantees about.
As Michael Cahill says in his thesis which became the SIGMOD paper of the year: “Serializable isolation enables the development of a complex system by composing modules, after verifying that each module maintains consistency in isolation.” It’s this incredibly important ability to compose that makes a system with transactions special.

Q3. FoundationDB claim to provide full ACID transactions. How do you do that?

Dave Rosenthal: In the same basic way as many other transactional databases do. We use a few strategies that tend to work well in distributed system such as optimistic concurrency and MVCC. We also, of course, have had to solve some of the fundamental challenges associated with distributed systems and all of the crazy things that can happen in them. Honestly, it’s not very hard to build a distributed transactional database. The hard part is making it work gracefully through failure scenarios and to run fast.

Q4. Is this similar to Oracle NoSQL?

Dave Rosenthal: Not really. Both Oracle NoSQL and FoundationDB provide an automatically-partitioned key-value store with fault tolerance. Both also have a concept of ordering keys (for efficient range operations) though Oracle NoSQL only provides ordering “within a Major Key set”. So, those are the similarities, but there are a bunch of other NoSQL systems with all those properties. The huge difference is that FoundationDB provides for ACID transactions over arbitrary keys and ranges, while Oracle NoSQL does not.

Q5. How would you compare your product offering with respect to NoSQL data stores, such as CouchDB, MongoDB, Cassandra and Riak, and NewSQL such as NuoDB and VoltDB?

Dave Rosenthal: The most obvious response for the NoSQL data stores would be “we have ACID transactions, they don’t”, but the more important difference is in philosophy and strategy.

Each of those products expose a single data model and interface. Maybe two. We are pursuing a fundamentally different strategy.
We are building a storage substrate that can be adapted, via layers, to provide a variety of data models, APIs, and true flexibility.
We can do that because of our transactional capabilities. CouchDB, MongoDB, Cassandra and Riak all have different APIs and we talk to companies that run all of those products side-by-side. The NewSQL database players are also offering a single data model, albeit a very popular one, SQL. FoundationDB is offering an ever increasing number of data models through its “layers”, currently including several popular NoSQL data models and with SQL being the next big one to hit. Our philosophy is that you shouldn’t have to increase the complexity of your architecture by adopting a new NoSQL database each time your engineers need access to a new data model.

Q6. Cloud computing and open source: How does it relate to FoundationDB?

Dave Rosenthal: Cloud computing: FoundationDB has been designed from the beginning to run well in cloud environments that make use of large numbers of commodity machines connected through a network. Probably the most important aspect of a distributed database designed for cloud deployment is exceptional fault tolerance under very harsh and strange failure conditions – the kind of exceptionally unlikely things that can only happen when you have many machines working together with components failing unpredictably. We have put a huge amount of effort into testing FoundationDB in these grueling scenarios, and feel very confident in our ability to perform well in these types of environments. In particular, we have users running FoundationDB successfully on many different cloud providers, and we’ve seen the system keep its guarantees under real-world hardware and network failure conditions experienced by our users.

Open source: Although FoundationDB’s core data storage engine is closed source, our layer ecosystem is open source. Although the core data storage engine has a very simple feature set, and is very difficult to properly modify while maintaining correctness, layers are very feature rich and because they are stateless, are much easier to create and modify which makes them well suited to third-party contributions.

Q7 Pls give some examples of use cases where FoundationDB is currently in use. Is FoundationDB in use for analyzing Big Data as well?

Dave Rosenthal: Some examples: User data, meta data, user social graphs, geo data, via ORMs using the SQL layer, metrics collection, etc.

We’ve mostly focused on operational systems, but a few of our customers have built what I would call “big data” applications, which I think of as analytics-focused. The most common use case has been for collecting and analyzing time-series data. FoundationDB is strongest in big data applications that call for lots of random reads and writes, not just big table scans—which many systems can do well.

Q8. Rick Cattel said in an recent interview [2] “there aren’t enough open source contributors to keep projects competitive in features and performance, and the companies supporting the open source offerings will have trouble making enough money to keep the products competitive themselves”. What is your opinion on this?

Dave Rosenthal: People have great ideas for databases all the time. New data models, new query languages, etc.
If nothing else, this NoSQL experiment that we’ve all been a part of the past few years has shown us all the appetite for data models suited to specific problems. They would love to be able to build these tools, open source them, etc.
The problem is that the checklist of practical considerations for a database is huge: Fault tolerance, scalability, a backup solution, management and monitoring, ACID transactions, etc. Add those together and even the simplest concept sounds like a huge project.

Our vision at FoundationDB is that we have done the hard work to build a storage substrate that simultaneously solves all those tricky practical problems. Our engine can be used to quickly build a database layer for any particular application that inherits all of those solutions and their benefits, like scalability, fault tolerance and ACID compliance.

Q9. Nick Heudecker of Gartner, predicts that [3] “going forward, we see the bifurcation between relational and NoSQL DBMS markets diminishing over time” . What is your take on this?

Dave Rosenthal: I do think that the lines between SQL and NoSQL will start to blur and I believe that we are leading that charge.We acquired another database startup last year called Akiban that builds an amazing SQL database engine.
In 2014 we’ll be bringing that engine to market as a layer running on top of FoundationDB. That will be a true ANSI SQL database operating as a module directly on top of a transactional “NoSQL” engine, inheriting the operational benefits of our core storage engine – scalability, fault tolerance, ease of operation.

When you run multiple of the SQL layer modules, you can point many of them at the same key-space in FoundationDB and it’s as if they are all part of the same database, with ACID transactions enforced across the separate SQL layer processes.
It’s very cool. Of course, you can even run the SQL layer on a FoundationDB cluster that’s also supporting other data models, like graph or document. That’s about as blurry as it gets.

———–
Dave Rosenthal is CEO of FoundationDB. Dave started his career in games, building a 3D real-time strategy game with a team of high-school friends that won the 1st annual Independent Games Festival. Previously, Dave was CTO at Visual Sciences, a pioneering web-analytics company that is now part of Adobe. Dave has a degree in theoretical computer science from MIT.

Related Posts
Operational Database Management Systems. Interview with Nick Heudecker, ODBMS Industry Watch December 16, 2013

Follow ODBMS.org on Twitter: @odbmsorg

 

]]>
http://www.odbms.org/blog/2014/03/dave-rosenthal/feed/ 0
Challenges and Opportunities for Big Data. Interview with Mike Hoskins http://www.odbms.org/blog/2013/12/challenges-and-opportunities-for-big-data-interview-with-mike-hoskins/ http://www.odbms.org/blog/2013/12/challenges-and-opportunities-for-big-data-interview-with-mike-hoskins/#comments Tue, 03 Dec 2013 07:52:01 +0000 http://www.odbms.org/blog/?p=2821

“We are facing an imminent torrent of machine generated data, creating volumes that will break the back of conventional hardware and software architectures. It is no longer be feasible to move the data to the compute process – the compute process has to be moved to the data” –Mike Hoskins.

On the topic, Challenges and Opportunities for Big Data, I have interviewed Mike Hoskins, Actian Chief Technology Officer.

RVZ

Q1. What are in your opinion the most interesting opportunities in Big Data?

Mike Hoskins: Until recently, most data projects were solely focused on preparation. Seminal developments in the big data landscape, including Hortonworks Data Platform (HDP) 2.0 and the arrival of YARN (Yet Another Resource Negotiator) – which takes Hadoop’s capabilities in data processing beyond the limitations of the highly regimented and restrictive MapReduce programming model – provides an opportunity to move beyond the initial hype of big data and instead towards the more high-value work of predictive analytics.
As more big data applications are built on the Hadoop platform customized by industry and business needs, we’ll really begin to see organizations leveraging predictive analytics across the enterprise – not just in a sandbox or in the domain of the data scientists, but in the hands of the business users. At that point, more immediate action can be taken on insights.

Q2. What are the most interesting challenges in Big Data?

Mike Hoskins: We are facing an imminent torrent of machine generated data, creating volumes that will break the back of conventional hardware and software architectures. It is no longer be feasible to move the data to the compute process – the compute process has to be moved to the data. Companies need to rethink their static and rigid business intelligence and analytic software architectures in order to continue working at the speed of business. It’s clear that time has become the new gold standard – you can’t produce more of it; you can only increase the speed at which things happen.
Software vendors with the capacity to survive and thrive in this environment will keep pace with the competition by offering a unified platform, underpinned by engineering innovation, completeness of solution and the service integrity and customer support that is essential to market staying power.

Q3. Steve Shine, CEO and President, Actian Corporation, said in a recent interview (*) that “the synergies in data management come not from how the systems connect but how the data is used to derive business value”. Actian has completed a number of acquisitions this year. So, what is your strategy for Big Data at Actian?

Mike Hoskins: Actian has placed its bets on a completely modern unified platform that is designed to deliver on the opportunities presented by the Age of Data. Our technology assets bring a level of maturity and innovation to the space that no other technology vendor can provide – with 30+ years of expertise in ‘all things data’ and over $1M investment in innovation. Our mission is to arm organizations with solutions that irreversibly shift the price/performance curve beyond the reach of traditional legacy stack players, allowing them to get a leg up on the competition, retain customers, detect fraud, predict business trends and effectively use data as their most important asset.

Q4. What are the products synergies related to such a strategy?

Mike Hoskins: Through the acquisition of Pervasive Software (a provider of big data analytics and cloud-based and on-premises data management and integration), Versant (an industry leader in specialized data management), and ParAccel (a leader in high-performance analytics), Actian has compiled a unified end-to-end platform with capabilities to connect, prep, optimize and analyze data natively on Hadoop, and then offer it to the necessary reporting and analytics environments to meet virtually any business need. All the while, operating on commodity hardware at a much lower cost than legacy software can ever evolve to.

Q5. What else still need to be done at Actian to fully deploy this strategy?

Mike Hoskins: There are definitely opportunities to continue integrating the platform experience and improve the user experience overall. Our world-class database technology can be brought closer to Hadoop, and we will continue innovating on analytic techniques to grow our stack upward.
Our development team is working diligently to create a common user interface across all of our platforms, as we bring out technology together. We have the opportunity to create a true first-class SQL engine running natively Hadoop, and to more fully exploit market leading cooperative computing with our On-Demand Integration (ODI) capabilities. I would also like to raise the awareness of the power and speed of our offerings as a general paradigm for analytic applications.

We don’t know what new challenges the Age of Data will bring, but we will continue to look to the future and build out a technology infrastructure to help organizations deal with the only constant – change.

Q6. What about elastic computing in the Cloud? How does it relate to Big Data Analytics?

Mike Hoskins: Elastic cloud computing is a convulsive game changer in the marketplace. It’s positive; if not where you do full production, at the very least it allows people to test, adopt and experiment with their data in a way that they couldn’t before. For cases where data is born in the cloud, using a 100% cloud model makes sense. However, much data is highly distributed in cloud and on-premises systems and applications, so it’s vital to have technology that can run and connect to either environments via a hybrid model.

We will soon see more organizations utilizing cloud platforms to run analytic processes, if that is where their data is born and lives.

Q7. How is your Cloud technology helping Amazon`s Redshift?

Mike Hoskins: Amazon Redshift leverages our high-performance analytics database technology to help users get the most out of their cloud investment. Amazon selected our technology over all other database and data warehouse technologies available in the marketplace because of the incredible performance, extreme scalability, and flexibility.

Q8. Hadoop is still quite new for many enterprises, and different enterprises are at different stages in their Hadoop journey.
When you speak with your customers what are the typical use cases and requirements they have?

Mike Hoskins: A recent survey of data architects and CIOs by Sand Hill Group revealed that the top challenge of Hadoop adoption was knowledge and experience with the Hadoop platform, followed by the availability of Hadoop and big data skills, and finally the amount of technology development required to implement a Hadoop-based solution. This just goes to show how little we have actually begun to fully leverage the capabilities of Hadoop. Businesses are really only just starting to dip their toe in the analytic water. Although it’s still very early, the majority of use cases that we have seen are centered around data prep and ETL.

Q9. What do you think is still needed for big data analytics to be really useful for the enterprise?

Mike Hoskins: If we look at the complete end-to-end data pipeline, there are several things that are still needed for enterprises to take advantage of the opportunities. This includes high productivity, performant integration layers, and analytics that move beyond the sphere of data science and into mainstream business usage, with discovery analytics through a simple UI studio or an analytics-as-a-service offering. Analytics need to be made more available in the critical discovery phase, to bring out the outcomes, patterns, models, discoveries, etc. and begin applying them to business processes.

Qx. Anything else you wish to add?

Mike Hoskins: These kinds of highly disruptive periods are, frankly, unnerving for the marketplace and businesses. Organizations cannot rely on traditional big stack vendors, who are unprepared for the tectonic shift caused by big data, and therefore are not agile enough to rapidly adjust their platforms to deliver on the opportunities. Organizations are forced to embark on new paths and become their own System Integrators (SIs).

On the other hand, organizations cannot tie their future to the vast number of startups, throwing darts to find the one vendor that will prevail. Instead, they need a technology partner somewhere in the middle that understands data in-and-out, and has invested completely and wholly as a dedicated stack to help solve the challenge.

Although it’s uncomfortable, it is urgent that organizations look at modern architectures, next-generation vendors and innovative technology that will allow them to succeed and stay competitive in the Age of Data.

—————————–
Mike Hoskins, Actian Chief Technology Officer
Actian CTO Michael Hoskins directs Actian’s technology innovation strategies and evangelizes accelerating trends in big data, and cloud-based and on-premises data management and integration. Mike, a Distinguished and Centennial Alumnus of Ohio’s Bowling Green State University, is a respected technology thought leader who has been featured in TechCrunch, Forbes.com, Datanami, The Register and Scobleizer. Mike has been a featured speaker at events worldwide, including Strata NY + Hadoop World 2013, the keynoter at DeployCon 2012, the “Open Standards and Cloud Computing” panel at the Annual Conference on Knowledge Discovery and Data Mining, the “Scaling the Database in the Cloud” panel at Structure 2010, and the “Many Faces of Map Reduce – Hadoop and Beyond” panel at Structure Big Data 2011. Mike received the AITP Austin chapter’s 2007 Information Technologist of the Year Award for his leadership in developing Actian DataRush, a highly parallelized framework to leverage multicore. Follow Mike on Twitter: @MikeHSays.

Related Posts

Big Data Analytics at Thomson Reuters. Interview with Jochen L. Leidner. November 15, 2013

On Big Data. Interview with Adam Kocoloski. November 5, 2013

Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett. September 23, 2013

(*) Acquiring Versant –Interview with Steve Shine. March 6, 2013

Resources

“Do You Hadoop? A Survey of Big Data Practitioners”, Bradley Graham M. R. Rangaswami, SandHill Group, October 29, 2013 (.PDF)

ActianVectorwise 3.0: Fast Analytics and Answers from Hadoop. Actian Corporation
Paper | Technical | English | DOWNLOAD(PDF)| May 2013|

]]>
http://www.odbms.org/blog/2013/12/challenges-and-opportunities-for-big-data-interview-with-mike-hoskins/feed/ 1
On Big Data and NoSQL. Interview with Renat Khasanshyn. http://www.odbms.org/blog/2013/10/on-big-data-and-nosql-interview-with-renat-khasanshyn/ http://www.odbms.org/blog/2013/10/on-big-data-and-nosql-interview-with-renat-khasanshyn/#comments Mon, 07 Oct 2013 19:50:27 +0000 http://www.odbms.org/blog/?p=2636

“The most important thing is to focus on a task you need to solve instead of a technology” –Renat Khasanshyn.

I have interviewed Renat Khasanshyn, Founder and Chairman of Altoros.
Renat is a NoSQL and Big Data specialist.

RVZ

Q1. In your opinion, what are the most popular players in the NoSQL market?

Khasanshyn: I think, MongoDB is definitely one of the key players of the NoSQL market. This database has a long history, I mean for this kind of products, and a good commercial support. For many people this database became the first mass market NoSQL store. I can assume that MongoDB is going to become something like MySQL for the field of relational databases. The second position I would give to Cassandra. It has a great architecture and enables building clusters with geographically dispersed nodes. For me it seems absolutely amazing. In addition, this database is often chosen by big companies that need a large highly available cluster.

Q2. How do you evaluate and compare different NoSQL Databases?

Khasanshyn: Thank you for an interesting question. How to choose a database? Which one is the best? These are the main questions for any company that wants to try a NoSQL solution. Definitely, for some cases it may be quite easy to select a suitable NoSQL store. However, very often it is not enough just to know customer’s business goals. When we suggest a particular database we take into consideration the following factors: the business issues a NoSQL store should solve, database reading/writing speed, availability, scalability, and many other important indicators. Sometimes we use a hybrid solution that may include several NoSQL databases.
Or we can see that a relational database will be a good match for a case. The most important thing is to focus on a task you need to solve instead of a technology.

We think that a good scalability, performance, and ease of administration are the most important criteria for choosing a NoSQL database. These are the key factors that we take into consideration. Of course, there are some additional criteria that sometimes may be even more important than those mentioned above. To somehow simplify a choice of a database for our engineers and for many other people, for two years, we carry out independent tests that evaluate performance of different NoSQL databases. Although aimed at comparing performance, these investigations also touch consistency, scalability, and configuration issues. You can take a look at our most recent article on this subject on Network World. Some new researches on this subject are to be published in a couple of months.

Q3. Which NoSQL databases did you evaluate so far, and what are the results did you obtain?

Khasanshyn: We used a great variety of NoSQL databases, for instance, Cassandra, HBase, MongoDB, Riak, Couchbase Server, Redis, etc. for our researches and real-life projects. From this experience, we learned that one should be very careful when choosing a database. It is better to spend more time on architecture design and make some changes to the project in the beginning rather than come across a serious issue in the future.

Q4. For which projects did use NoSQL databases and for what kind of problems?

Khasanshyn: It is hardly possible to name a project for which a NoSQL database would be useless, except for a blog or a home page. As the main use cases for NoSQL stores I would mention the following tasks:

● collecting and analyzing large volumes of data
● scaling large historical databases
● building interactive applications for which performance and fast response time to users’ actions are crucial

The major “drawback” of NoSQL architecture is the absence of ACID engine that provides a verification of transaction. It means that financial operations or user registration should be better performed by RDBMS like Oracle or MS SQL. However, absence of ACID allows for significant acceleration and decentralization of NoSQL databases which are their major advantages. The bottom line, non-relational databases are much faster in general, and they pay for it with a fraction of their reliability. Is it a good tradeoff? Depends on the task.

Q5. What do you see happening with NoSQL, going forward?

Khasanshyn: It’s quite difficult to make any predictions, but we guess that NoSQL and relational databases will become closer. For instance, NewSQL solutions took good scalability from NoSQL and a query language from the SQL products.
Probably, a kind of a standard query language based on SQL or SQL-like language will soon appear for NoSQL stores. We are also looking forward to improved consistency, or to be more precise, better predictability of NoSQL databases. In other words, NoSQL solutions should become more mature. We will also see some market consolidation. Smaller players will form alliances or quit the game. Leaders will take bigger part of the market. We will most likely see a couple of acquisitions. Overall, it will be easier to work with NoSQL and to choose a right solution out of the available options.

Q6. What what do you think is still needed for big data analytics to be really useful for the enterprise?

Khasanshyn: It is all about people and their skills. Storage is relatively cheap and available. Variety of databases is enormous and it helps solving virtually any task. Hadoop is stable. Majority of software is open source-based, or at least doesn’t cost a fortune. But all these components are useless without data scientists who can do modeling and simulations on the existing data. As well as without engineers who can efficiently employ the toolset. As well as without managers who understand the outcomes of data handling revolution that happened just recently. When we have these three types of people in place, then we will say that enterprises are fully equipped for making an edge in big data analytics.

Q7. Hadoop is still quite new for many enterprises, and different enterprises are at different stages in their Hadoop journey. When you speak with your customers what are the typical use cases and requirements they have?

Khasanshyn: I agree with you. Some customers just make their first steps with Hadoop while others need to know how to optimize their Hadoop-based systems. Unfortunately, the second group of customers is much smaller. I can name the following typical tasks our customers have:

● To process historical data that has been collected for a long period of time. Some time ago, users were unable to process large volumes of unstructured data due to some financial and technical limitations. Now Hadoop can do it at a moderate cost and for reasonable time.

● To develop a system for data analysis based on Hadoop. Once an hour, the system builds patterns on a typical user behavior on a Web site. These patterns help to react to users’ actions in the real-time mode, for instance, allow doing something or temporary block some actions because they are not typical of this user. The data is collected continuously and is analyzed at the same time. So, the system can rapidly respond to the changes in the user behavior.

● To optimize data storage. It is interesting that in some cases HDFS can replace a database, especially when the database was used for storing raw input data. Such projects do not need an additional database level.

I should say that our customers have similar requirements. Apart from solving a particular business task, they need a certain level of performance and data consistency.

Q8. In your opinion is Hadoop replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?

Khasanshyn: In a few words, my answer is yes. Some specific features of Hadoop enable to prepare data for future analysis fast and at a moderate cost. In addition, this technology can work with unstructured data. However, I do not think it will happen very soon. There are many OLAP systems and they solve their tasks, doing it better or worse. In general enterprises are very reluctant to change something. In addition, replacing the OLAP tools requires additional investments. The good news is that we don’t have to choose one or another. Hadoop can be used as a pre-processor of the data for OLAP analysis. And analysts can work with the familiar tools.

Q9. How do you categorize the various stages of the Hadoop usage in the enterprises?

Khasanshyn: I would name the following stages of Hadoop usage:

1. Development of prototypes to check out whether Hadoop is suitable for their tasks
2. Using Hadoop in combination with other tools for storing and processing data of some business units
3. Implementation of a centralized enterprise data storage system and gradual integration of all business units into it

Q10. Data Warehouse vs Big “Data Lake”. What are the similarities and what are the differences?

Khasanshyn: Even though Big “Data Lake” is a metaphor, I do not really like it.
This name highlights that it is something limited, isolated from other systems. I would better call this concept a “Big Data Ocean”, because the data added to the system can interact with the surrounding systems. In my opinion, data warehouses are a bit outdated. At the earlier stage, such systems enabled companies to aggregate a lot of data in one central storage and arrange this data. All this was done with the acceptable speed. Now there are a lot of cheap storage solutions, so we can keep enormous volumes of data and process it much faster than with data warehouses.

The possibility to store large volumes of data is a common feature of data warehouses and a Big “Data Lake”. With a flexible architectures and broad capabilities for data analysis and discovery, a Big “Data Lake” provides a wider range of business opportunities. A modern company should adjust to changes very fast. The structure that was good yesterday may become a limitation today.

Q11. In your opinion, is there a technology which is best suited to build a Big Data Analytics Data Platform? If yes, which one?

Khasanshyn: As I have already said, there is no “magic bullet” that can cure every disease. There is no “Universal Big Data Analytics Data Platform” fitting each size; everything depends on the requirements of a particular business. A system of this kind should have the following features:

● A Hadoop-based system for storing and processing data
● An operational database that contains the most recent data, it can be raw data, that should be analyzed.
A NoSQL solution can be used for this case.
● A database for storing predicted indicators. Most probably, it should be a relational data store.
● A system that allows for creating data analysis algorithms. The R language can be used to complete this task.
● A report building system that provides access to data. For instance, there such good options like Tableau or Pentaho.

Q12. What about elastic computing in the Cloud? How does it relate to Big Data Analytics?

Khasanshyn: In my opinion, cloud computing became the force that raised a Big Data wave. Elastic computing enabled us to use the amount of computation resources we need and also reduced the cost of maintaining a large infrastructure.

There is a connection between elastic computing and big data analytics. For us it is quite a typical case that we have to process data from time to time, for instance, once a day. To solve this task, we can deploy a new cluster or just scale up an existing Hadoop cluster in the cloud environment. We can temporary increase the speed of data processing by scaling a Hadoop cluster. The task will be executed faster and after that we can stop the Hadoop cluster or reduce its size. I can even say, that cloud technologies is a must have component for a Big Data analysis system.

——-

Renat Khasanshyn, Founder and Chairman, Altoros.
Renat is founder & CEO of Altoros, and Venture Partner at Runa Capital. Renat helps define Altoros’s strategic vision, and its role in Big Data, Cloud Computing, and PaaS ecosystem. Renat is a frequent conference and meetup speaker on this topic.
Under his supervision Altoros has been servicing such innovative companies as Couchbase, RightScale, Canonical, DataStax, Joyent, Nephoscale, and NuoDB.
In the past, Renat has been selected as finalist for the Emerging Executive of the Year award by the Massachusetts Technology Leadership Council and once won an IBM Business Mashup Challenge. Prior to founding Altoros, Renat was VP of Engineering for Tampa-based insurance company PriMed. Renat is also founder of Apatar, an open source data integration toolset, founder of Silicon Valley NewSQL User Group and co-founder of the Belarusian Java User Group.

——————–
Related Posts

On NoSQL. Interview with Rick Cattell. August 19, 2013

On Oracle NoSQL Database –Interview with Dave Segleau. July 2, 2013

On Big Data and Hadoop. Interview with Paul C. Zikopoulos. June 10, 2013
————————-

Resources

Evaluating NoSQL Performance, Sergey Bushik, Altoros (Slideshare)

“A Vendor-independent Comparison of NoSQL Databases: Cassandra, HBase, MongoDB, Riak”. Altoros (registration required)

ODBMS.org free resources on Big Data and Analytical Data Platforms:
Blog Posts | Free Software| Articles | Lecture Notes | PhD and Master Thesis|

ODBMS.org free resources on NoSQL Data Stores:
Blog Posts | Free Software | Articles, Papers, Presentations| Documentations, Tutorials, Lecture Notes | PhD and Master Thesis

ODBMS.org free resources on Cloud Data Stores:
Blog Posts | Lecture Notes| Articles and Presentations| PhD and Master Thesis|

———————–

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2013/10/on-big-data-and-nosql-interview-with-renat-khasanshyn/feed/ 1
Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett. http://www.odbms.org/blog/2013/09/data-analytics-at-nbcuniversal-interview-with-matthew-eric-bassett/ http://www.odbms.org/blog/2013/09/data-analytics-at-nbcuniversal-interview-with-matthew-eric-bassett/#comments Mon, 23 Sep 2013 14:48:10 +0000 http://www.odbms.org/blog/?p=2639

“The most valuable thing I’ve learned in this role is that judicious use of a little bit of knowledge can go a long way. I’ve seen colleagues and other companies get caught up in the “Big Data” craze by spend hundreds of thousands of pounds sterling on a Hadoop cluster that sees a few megabytes a month. But the most successful initiatives I’ve seen treat it as another tool and keep an eye out for valuable problems that they can solve.” –Matthew Eric Bassett.

I have interviewed Matthew Eric Bassett, Director of Data Science for NBCUniversal International.
NBCUniversal is one of the world’s leading media and entertainment companies in the development, production, and marketing of entertainment, news, and information to a global audience.
RVZ

Q1. What is your current activity at Universal?

Bassett: I’m the Director of Data Science for NBCUniversal International. I lead a small but highly effective predictive analytics team. I’m also a “data evangelist”; I spend quite a bit of my time helping other business units realize they can find business value from sharing and analyzing their data sources.

Q2. Do you use Data Analytics at Universal and for what?

Bassett: We predict key metrics for the different businesses – everything from television ratings, to how an audience will respond to marketing campaigns, to the value of a particular opening weekend for the box office. To do this, we use machine learning regression and classification algorithms, semantic analysis, monte-carlo methods, and simulations.

Q3. Do you have Big Data at Universal? Could you pls give us some examples of Big Data Use Cases at Universal?

Bassett: We’re not working with terabyte-scale data sources. “Big data” for us often means messy or incomplete data.
For instance, our cinema distribution company operates in dozens of countries. For each day in each one, we need to know how much money was spent and by whom -and feed this information into our machine-learning simulations for future predictions.
Each country might have dozens more cinema operators, all sending data in different formats and at different qualities. One territory may neglect demographics, another might mis-report gross revenue. In order for us to use it, we have to find missing or incorrect data and set the appropriate flags in our models and reports for later.

Automating this process is the bulk of our Big Data operation.

Q4. What “value” can be derived by analyzing Big Data at Universal?

Bassett: “Big data” helps everything from marketing, to distribution, to planning.
“In marketing, we know we’re wasting half our money. The problem is that we don’t know which half.” Big data is helping us solve that age-old marketing problem.
We’re able to track how the market is responding to our advertising campaigns over time, and compare it to past campaigns and products, and use that information to more precisely reach our audience (a bit how the Obama campaign was able to use big data to optimize its strategy).

In cinema alone, the opening weekend of a film can affect gross revenue by seven figures (or more), so any insight we can provide into the most optimal time can directly generate thousands or millions of dollars in revenue.

Being able to distill “big data” from historical information, audiences responses in social media, data from commercial operators, et cetera, into a useable and interactive simulation completely changes how we plan our strategy for the next 6-15 months.

Q5. What are the main challenges for big data analytics at Universal ?

Bassett: Internationalization, adoption, and speed.
We’re an international operation, so we need to extend our results from one country to another.
Some territories have a high correlation between our data mining operation and the metrics we want to predict. But when we extend to other territories we have several issues.
For instance, 1) it’s not as easy for us to do data mining on unstructured linguistic data (like audience’s comments on a youtube preview) and 2) User-generated and web analytics data is harder to find (and in some cases nonexistent!) in some of our markets, even if we did have a multi-language data mining capability. Less reliable regions, send us incoming data or historicals that are erroneous, incomplete, or simply not there – see my comment about “messy data”.

Reliability with internationalization feeds into another issue – we’re in an industry that historically uses qualitative and not quantitative processes. It takes quite a bit of “evangelicalism” to convince people what is possible with a bit of statistics and programming, and even after we’ve created a tool for a business, it takes some time for all the key players to trust and use it consistently.

A big part of accomplishing that is ensuring that our simulations and predictions happen fast.
Naturally, our systems need to be able to respond to market changes (a competing film studio changes a release date, an event in the news changes television ratings, et cetera) and inform people what happens.
But we need to give researchers and industry analysts feedback instantly – even while the underlying market is static – to keep them engaged. We’re often asking ourselves questions like “how can we make this report faster” or “how can we speed up this script that pulls audience info from a pdf”.

Q6. How do you handle the Big Data Analytics “process” challenges with deriving insight?
For example when:

  • -capturing data
  • -aligning data from different sources (e.g., resolving when two objects are the same)
  • -transforming the data into a form suitable for analysis
  • -modeling it, whether mathematically, or through some form of simulation
  • -understanding the output
  • -visualizing and sharing the results

Bassett: We start with the insight in mind: What blind-spots do our businesses have, what questions are they trying to answer and how should that answer be presented? Our process begins with the key business leaders and figuring out what problems they have – often when they don’t yet know there’s a problem.

Then we start our feature selection, and identify which sources of data will help achieve our end goal – sometimes a different business unit has it sitting in a silo and we need to convince them to share, sometimes we have to build a system to crawl the web to find and collect it.
Once we have some idea of what we want, we start brainstorming about the right methods and algorithms we should use to reveal useful information: Should we cluster across a multi-variate time series of market response per demographic and use that as an input for a regression model? Can we reliably get a quantitative measure of a demographics engagement from sentiment analysis on comments? This is an iterative process, and we spend quite a bit of time in the “capturing data/transforming the data” step.
But it’s where all the fun is, and it’s not as hard as it sounds: typically, the most basic scientific methods are sufficient to capture 90% of the business value, so long as you can figure out when and where to apply it and where the edge cases lie.

Finally, we have an another excited stage: find surprising insight from the results.
You might start by trying to get a metric for risk in cinema, and you might find a metric for how the risk changes for releases that target a specific audience in the process – and this new method might work for a different business.

Q7. What kind of data management technologies do you use? What is your experience in using them? Do you handle un-structured data? If yes, how?

Bassett: For our structured, relational data, we make heavy use of MySQL. Despite collecting and analyzing a great deal of un-structured data, we haven’t invested much in a NoSQL or related infrastructure. Rather, we store and organize such data as raw files on Amazon’s S3 – it might be dirty, but we can easily mount and inspect file systems, use our Bash kung-fu, and pass S3 buckets to Hadoop/Elastic MapReduce.

Q8. Do you use Hadoop? If yes, what is your experience with Hadoop so far?

Bassett: Yes, we sometimes use Hadoop for that “learning step” I described earlier, as well as batch jobs for data mining on collected information. However, our experience is limited to Amazon’s Elastic MapReduce, which makes the whole process quite simple – we literally write our map and reduce procedures (in whatever language we chose), tell Amazon where to find the code and the data, and grab some coffee while we wait for the results.

Q9. Hadoop is a batch processing system. How do you handle Big Data Analytics in real time (if any)?

Bassett: We don’t do any real-time analytics…yet. Thus far, we’ve created a lot of value from simulations that responds to changing marketing information.

Q10 Cloud computing and open source: Do you they play a role at Universal? If yes, how?

Bassett: Yes, cloud computing and open source play a major role in all our projects: our whole operation makes extensive use of Amazon’s EC2 and Elastic MapReduce for simulation and data mining, and S3 for data storage.

We’re big believers in functional programming – many projects start with “experimental programming” in Racket (a dialect of the Lisp programming
language) and often stay there into production.

Additionally, we take advantage of the thriving Python community for computational statistics: Ipython notebook, NumPy, SciPi, NLTK, et cetera.

Q11 What are the main research challenges ahead? And what are the main business challenges ahead?

Bassett: I alluded to some already previously: collecting and analyzing multi-lingual data, promoting the use of predictive analytics, and making things fast.

Recruiting top talent is frequently a discussion among my colleagues, but we’ve been quite fortunate in this regards. (And we devote a great deal of time in training for machine learning and big data methods.)

Qx Anything else you wish to add?

Bassett: The most valuable thing I’ve learned in this role is that judicious use of a little bit of knowledge can go a long way. I’ve seen colleagues and other companies get caught up in the “Big Data” craze by spend hundreds of thousands of pounds sterling on a Hadoop cluster that sees a few megabytes a month. But the most successful initiatives I’ve seen treat it as another tool and keep an eye out for valuable problems that they can solve.

Thanks!

—–

Matthew Eric Bassett -Director of Data Science, NBCUniversal International
Matthew Eric Bassett is a programmer and mathematician from Colorado and started his career there building web and database applications for public and non-profit clients. He moved to London in 2007 and worked as a consultant for startups and small businesses. In 2011, he joined Universal Pictures to work on a system to quantify risk in the international box office market, which led to his current position leading a predictive analytics “restructuring” of NBCUniversal International.
Matthew holds an MSci in Mathematics and Theoretical Physics from UCL and is currently pursuing a PhD in Noncommutative Geometry from Queen Mary, University of London, where he is discovering interesting, if useless, applications of his field to number theory and machine learning.

Resources

How Did Big Data Help Obama Campaign? (Video Bloomberg TV)

Google’s Eric Schmidt Invests in Obama’s Big Data Brains (Bloomberg Businessweek Technology)

Cloud Data Stores – Lecture Notes: “Data Management in the Cloud”. Michael Grossniklaus, David Maier, Portland State University.
Lecture Notes | Intermediate/Advanced | English | DOWNLOAD ~280 slides (PDF)| 2011-12|

Related Posts

Big Data from Space: the “Herschel” telescope. August 2, 2013

Cloud based hotel management– Interview with Keith Gruen July 25, 2013

On Big Data and Hadoop. Interview with Paul C. Zikopoulos. June 10, 2013

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2013/09/data-analytics-at-nbcuniversal-interview-with-matthew-eric-bassett/feed/ 0
Isis2: A new Open Platform for Data Replication in the Cloud.–Interview with Ken Birman. http://www.odbms.org/blog/2013/07/isis2-a-new-open-platform-for-data-replication-in-the-cloud-interview-with-ken-birman/ http://www.odbms.org/blog/2013/07/isis2-a-new-open-platform-for-data-replication-in-the-cloud-interview-with-ken-birman/#comments Thu, 11 Jul 2013 06:31:18 +0000 http://www.odbms.org/blog/?p=2359

“My target is to be the MapReduce solution for the world’s realtime problems.”— Ken Birman.

I had the pleasure to interview Ken Birman, Professor of Computer Science at Cornell University. Ken is a pioneer and one of the leading researchers in the field of distributed systems. Ken has been working recently on a new system called Isis2 (isis2.codeplex.com). I asked him a few questions on Isis2.

RVZ

Q1. Recently, you have been working on the Isis2 project. What is it?

Ken Birman: Isis2 started as a bit of a hobby for me, and the story actually dates back almost 25 years.
Early in my career, I created a system called the Isis Toolkit, using a model we called virtual synchrony to support strongly consistent replication for services running on clusters of various kinds. We had quite a success with that version of Isis, and it was the basis of the applications I mentioned above – I started a company and we did very well with it. In fact, for more than a decade there was never a disruptive failure at the New York Stock Exchange: components crashed now and then, obviously, but Isis would help guide the solution to a clean recovery and the traders were never aware of any issues. The same is true for the French ATC system or the US Navy AEGIS.

Now this model, virtual synchrony, has deep connections both to the ACID models used in database settings, and to the state machine replication model supported by protocols like Lamport’s Paxos. Indeed, recently we were able to show a bisimulation of the virtually synchronous protocol called gbcast (dating to around 1985) and a version of Paxos for dynamically reconfigurable systems. In some sense, gbcast was the first Paxos – Leslie Lamport says we should have named the protocol “virtually synchronous Paxos”.
(Of course if we had, I suspect that he would have named his own protocol something else!). I could certainly do the same with respect to database serializability – basically, virtual synchrony is like the ACID model, but aimed at “groups of processes” that might use purely in-memory replication. In effect my protocols were optimized ones aimed at supporting ACI but without the D.

Anyhow, over the years, the Isis Toolkit matured and people moved on. I moved on too, and worked on topics involving gossip communication, and security. But eventually this came to frustrate me: I felt that my own best work was being lost to disuse. And so starting four years ago, I decided to create a new and more modern Isis for the cloud. I’m calling it Isis2, and it uses the model Leslie and I talked about (virtually synchronous state machine replication). I’ve implemented the whole thing from scratch, using modern object-oriented languages and tools.

So Isis2 is a new open platform for data replication in the cloud, aiming at cases where a developer might be building some sort of service that would end up running on a cloud platform like Eucalyptus, Amazon EC2, etc. My original idea was to create the ultimate library for distributed computing, but also to make it as easy to use as the GUI builders that our students use in their first course on object oriented programming: powerful technologies but entirely accessed via very simple ideas, such as attaching an event handler to a suitable object. A first version of the Isis2 system became available about two years ago, and I’ve been working on it as much as possible since then, with new releases every couple of months.

But I’ve also slightly shifted my focus, in ways that are bringing Isis2 closer and closer to the big-data and ODBMS community.
I realized that unlike the situation 20 years ago, today the people who most need tools for replicating data with strong consistency are mostly working to create in-memory big-data platforms for the cloud, and then running large machine-learning algorithms on them.
For example, one important target for Isis2 is to support the smart grid here in the United States. So one imagines an application capturing all sorts of data in real-time, from what might be a vast number of sensing points, pulling this into the cloud, and then running a machine learning algorithm on the resulting in-memory data set. You replicate such a service for high availability and to gain faster read-only performance, and then run a distributed algorithm that learns the parameters to some model: perhaps a convex optimization, perhaps a support vector, etc.

I’ve run into other applications with a similar structure. For example, self-driving cars and other AUVs need to query a rapidly evolving situational database, asking questions such as “what road hazards should I be watching for the next 250 meters?” The community doing large-scale simulations for problems in particle physics needs to solve the “nearby particle” problem: “now that I’m at location such-and-such, what particles are close enough to interact with me?” One can make quite a long list. All of these demand rapid updates, in place, to a database that is living in-memory and being queried frequently.

With this in mind, I’ve been extending the Isis2 system to enlarge its support for key-value data, spread within a service and replicated, but with each item residing at just small subsets of the total set of members (“sharded”). My angle has been to offer strong consistency (not full transactions, because those involve locking and become expensive, but a still-powerful “one-shot atomic actions” model). Because Isis2 can offer these guarantees for a workload that includes a mix of updates and queries, the community working on graphical learning and other key-value learning problems where data might evolve even as the system is tracking various properties has shown strong interest in the platform.

Q2. You claim that Isis2 is a new option for cloud computing that can enable reliable, secure replication of data even in the highly elastic first-tier of the cloud. How does it different from existing commercial products such as NoSQL, Amazon Web Services etc.?

Ken Birman: Well, there are really a few differences. One is that Isis2 is a library. So if you compare with other technologies, the closest matches are probably Graphlab from CMU and Spark, Berkeley’s in-memory storage system for accelerating Hadoop computations. A second big difference is that Isis2 has a very strong consistency model, putting me at the other end of the spectrum relative to NoSQL. As for Web Services, well, the thinking is that these services you use Isis2 to create would often be web services, accessed over the web via some representative, which then turns around and interacts with other members using the Isis2 primatives.

Q3. How do you handle Big Data in Isis2

Ken Birman: I’m adding more and more features, but there are two that should be especially interesting to your readers. One is the Isis2 DHT: a key-value store spread over a group of nodes, such that any given key maps to some small subset of the members (a shard). So keys might, for example, be node-ids for a graph, and the values could represent data associated with that node: perhaps a weight, in and out edges to other nodes, etc. The model is incredibly flexible: a key can be any object you like, and the values can also be arbitrary objects. You even can control the mapping of keys to group members, by implementing a specialized version of GetHashCode().

What I do is to allow atomic actions on sets of key-value pairs. So you can insert a collection of key-value pairs, or query a set of them, or even do an ordered multicast to just the nodes that host some set of keys. These actions occur on a consistent cut, giving a form of action by action atomicity. And of course, the solution is also fault-tolerant and even offers a strong form of security via AES-256 encryption, if desired.

What this enables is a powerful new form of distributed aggregation, in which one can guarantee that a query result reflects exactly-once contributions from each matching key-value pair. You can also insert new key-value tuples as part of one of these actions (although those occur as new atomic actions, ordered after the one that initiated them), or even reshuffle the mapping of keys to nodes, by “reconfiguring” the group to use a new GetHashCode() method.

You can also use LINQ in these groups, for example to efficiently query the “slice” of a key-value set that mapped to a particular set of nodes.

As I mentioned, I also have a second big-data feature that should become available later this summer: a tool for moving huge objects around within a cluster of cloud nodes. So suppose that an application is working with gigabyte memory-mapped files.
Sending these around inside messages could be very costly, and will cause the system to stutter because anything ordered after that send or multicast might have to wait for the gigabyte to transfer. With this new feature, I do out-of-band movement of the big objects in a very efficient way, using IP multicast if available (and a form of fat-tree TCP mesh if not), and then ship just a reference to the big object in my messages. This way the in-band communication won’t stutter, and the gigabyte object gets replicated using very efficient out-of-band protocols. By the end of the summer, I’m hoping to have the world’s fastest tool for replicating big memory-mapped objects in the cloud. (Of course, there can be quite a gap between “hoping” and “will have”, but I’m optimistic).

Q4. Isis2 uses a distributed sharding scheme. What is it?

Ken Birman: Again, there are a few answers – if Isis2 has a flaw, I suspect that it comes down to trying to offer every possible cool mechanism to every imaginable developer. Just like Microsoft .NET this can mean that you end up with too many stories. (You won’t be surprised to learn that Isis2 is actually built in .NET, using C#, and hence usable from any .NET language. We use Mono to compile it for use on Linux. And it does have a bit of that .NET flavor).

Anyhow, one scheme is the sharding approach mentioned above. The user takes some group – perhaps it has 10,000 members in it and runs on 10,000 virtual machines on Amazon EC2. And they ask Isis2 to shard the group into shards of some size, maybe 5 or 10 – you pick a factor aimed at soaking up the query load but not imposing too high an update cost.

So in that case I might end up with 1000 or 2000 subgroups: those would be my shards. The mapping is very explicit: given a key-value pair with key K, I compute K.GetHashCode()%NS where NS is the number of shards obtained from the target group size (call it N) and the target shard size (call this S): hashcode%(N/S). And that tells me which shard will hold your key-value pair.
Given the group membership, which is consistent in Isis2, my protocols just count off from left to right to find the members that host this shard. The atomic update or query reaches out to those members: I involve all of them if the action is an update, and I load-balance over the shard for queries.

The other mechanism is coarser-grained: one Isis2 application can actually support multiple process groups. And each of those groups can be a DHT, sharded separately. This is convenient if an application has one long-lived in-memory group for the database, but wants to create temporary data structures too. What you can do is to use a separate temporary group for those temporary key-value pairs or structures. Then when your computation is finished, you have an easy way to clean up: you just tell Isis2 to discard the temporary group and all its contents will evaporate, like magic.

Q5. Isis2 sounds like MapReduce. Is this right? What are the differences and what are the similarities?

Ken Birman: Roberto, you are exactly right about this. While you can treat the Isis2 DHT as a key-value store, a natural style of computing would be to use the kinds of code one sees with iterative MapReduce applications. Isis2 is in-memory, of course (although nothing stops you from pairing it with a persistent database or log – in fact I provide tools for doing that). But you could do this, and if you did, Isis2 would be acting a lot like Spark.

I think the main point is that whereas a system like Spark just speeds MapReduce up by caching partial results, and really works only for immutable or append-only data sets, Isis2 can support dynamic updates while the queries are running. My target is to be the MapReduce solution for the world’s realtime problems. Moreover, I’m doing this for people who need strong consistency.
Get the answer wrong in the power grid and you cause a power outage or damage a transformer (does anyone remember how Christ Church took itself off the New Zealand power grid for a month?) Get the answer wrong in a self-driving car, and it might have an accident. So for situations in which data is rapidly evolving, Isis2 offers a way to do highly scalable queries even as you support updates in a highly scalable way too.

I actually see the similarity to MapReduce as a good thing. To me this model has been incredibly popular, because people find it easy to use. I can leverage that popularity.

The bigger contrast is with true transactional databases. Isis2 does support locking, but not full-scale transactions. It would be dishonest to say that I don’t think transactions can run at massive scale – in fact I’m working with a post-doc, Ittay Eyal, on an extension of Isis2 that we call ACID-RAIN that has a full transactional model. And I know of others who are doing competing systems – Marcos Aguilera at Microsoft, for example, who is hard at work on his follow-on to the famous Sinfornia system.

But I don’t think we need the full costs of ACID transactions for in-memory key-value stores, and this has pushed me towards a lock-free computing model for this part of Isis2, and towards the kind of weaker atomicity I offer: guarantees for individual actions on sets of key-value pairs, but with each step in your computation treated as a separate one-shot transaction.

Q6. You claim to be able to run consistent queries even on rapidly changing data, and yet scales as well as any sharding scheme. Please explain how?

Ken Birman: The question centers on semantics: what do I mean by a consistent query? For me, a query that ran on a consistent cut (like with snapshot isolation) and gives a result that reflects exactly one contribution from each matching key-value pair, is a “consistent” query. This is definitely what you want for some purposes. But if you want to define consistency to mean for full transactions, I’m not taking that last step. I offer the building blocks – locking, atomic multicast, etc. You could easily run a transaction and do a 2-phase commit at the end. But I just doubt that this would perform well and so I haven’t picked it as my main consistency model.

Q7. Why using LINQ as a query language and not SQL?

Ken Birman: I’m using LINQ, but maybe not in the way you are thinking. There are some projects that do use LINQ as their main user API. In the Isis2 space, Naiad and the Dryad system that preceded it would be famous examples. But I find it hard to work with systems that “map” my LINQ query to a distributed key-value structure. I’ve always favored simpler building blocks.

So the way I’m using LINQ is more limited. For me, a user might issue some sort of query, and at the first step this looks like a multicast: you specify the target group or the target shards (depending on whether you want every member or just the ones associated with certain keys), and the information you offered as arguments to the query or multicast are automatically translated to an efficient out-form and transmitted to the relevant members. On the members an upcall event occurs to a handler the developer coded, perhaps in C# or perhaps in some other .NET language like C++/CLI, Visual Basic, Java, etc. I think .NET has something like 40 languages you can use – I myself work mostly in C#.

So this C# handler is invoked with the specified arguments, in parallel, on the set of group members that matched the target you specified. Each one of them has a slice of the full DHT: the subset of key-value pairs that mapped to that particular member, in the way we talked about earlier – a mapping you as the user can control, and even modify dynamically (if you do, Isis2 reshuffles the key-value pairs appropriately).

What I do with LINQ is to let your code access this slice of the DHT as a collection on which you could run a LINQ query. And in fact, as you may know, LINQ has an SQL front end, so you could even use SQL on these collections. So the C# handler has this potentially big set of key-value tuples, the full set for that slice (remember that I guarantee consistency!), and with LINQ each of those members now can compute its share of the answer.

What happens next is up to you. You could use this answer to insert new key-value tuples: that would be like the shuffle and reduce step in MapReduce. You could send the answers to the query initiator, as a reply – Isis2 supports 1-K queries that get lists of K results back, and the initiator could then aggregate the results (perhaps using LINQ at that step too, or SQL). And finally I have a tree-structured aggregation option, where I build a binary tree spanning the participants and combine the subresults using user-specified code, so that just one aggregated result comes out, after log(N) delay. That last option would be sensible if you might end up with a very large number of answers, or really big ones.

Q8. How fault-tolerant is Isis2?

Ken Birman: Virtual synchrony is very powerful as a tool for building pretty much any fault-tolerance approach one likes (people who prefer Paxos can reread that sentence as “the dynamic state machine model…” and people who think in terms of ACID transactions can see this as “ACID-based SQL systems…”). The model I favor is one in which updates are always applied in an all-or-nothing way, but queries might be partially completed and then abandoned (“aborted”) if a failure occurs.

So with Isis2, the basic guarantee is that an atomic multicast or query reaches all the operational group members that it is supposed to reach, and in a total order with respect to conflicting updates. Queries either reflect exactly once contributions from each matching key-value pair, or they abort (if a failure disrupts them), and you need to reissue the request in the new group membership – the new process group “view”.

Q9. What about Hadoop? Do you plan to have an interface to Hadoop?

Ken Birman: I’ve suggested this topic to a few of my PhD students but up to now, none has “bit”. I think my group tends to attract students who have more of a focus on lower levels of the infrastructure. But with our work on the smart power grid, this could change; I’m starting to interact much more with students who come from the machine learning community and who might find that step very appealing. It wouldn’t really be all that hard to do.

Q10. Isis2 is open source. How can developers contribute?

Ken Birman: This is a great question. The basic system is open source under a free 3-clause BSD license. You can access it here: isis2.codeplex.com. – I have a big user manual there, and one of those compiled HTML documentation pages for each of my APIs, and I’m going to be doing some form of instructional MOOC soon, so there should end up being ten or so short videos showing how to program with the system.

Initially, I think that people who would want to play with Isis2 might be best off limiting themselves to working with the system but not changing my code directly. My worry is that the code really is quite subtle and while I would love to have help, my experience here at Cornell has been that even well-intentioned students have made a lot of mistakes by not really understanding my code and then trying to change it. I’m sorry to say that as of now, not a line of third party code has survived – not because I don’t want help (I would love help), but because so far, all the third-party code has died horribly when I really tested it carefully!

But over time, I’m hoping that Isis2 could become more and more of a community tool. In fact complex or not, I do think others could definitely master it and help me debug and extend it. They just need to move no faster than the speed of light, which is kind of slow where large, complex tools are concerned. Building things that work “over” Isis2 but don’t change it is an especially good path: I’ll be happy to fix bugs other people identify, and then your add-ons can become third party extensions without being somehow key parts of the system. Then as things mature (keep in mind that Isis2 itself is nearly four years old right now), things could gradually migrate into the core release.

I think this is how other big projects tend to evolve towards an open development model: nobody trusts a contributor who shows up one day and announces that he wants to rewrite half the system. But if that person hangs around for a while and proves his or her talents over time, they end up in the inner circle. So I’m not trying to be overprotective, but I do want the system to be incredibly robust.

This is how we’re building the ACID-RAIN system I mentioned. Ittay Eyal owns the architecture of that system, but he has no interest at all in replicating things already available in Isis2, which after all is quite a big and powerful system. So he’s using it but rather than building ACID-RAIN by modifying Isis2, his system will be more of an application that uses Isis2. To the extent that he needs things Isis2 is lacking, I can build them for him. But later, if ACID-RAIN becomes the world’s ultimate SQL solution for the cloud, maybe Isis2 and ACID-RAIN would merge in some way. Over time I have no doubt at all that a talented developer like Ittay could become as expert as he needs to be even with my most obscure stuff.

And the fact is that I do need help. Tuning Isis2 to work really well in these massive settings is a hard challenge for me; more hands would definitely help. Right now I’m in the middle of porting it to work well with Infiniband connections, for example. You might thing such a step would be trivial, and in a way it is: I just need to adapt the way that I talk to my network sockets. But in fact this has all sorts of implications for flow control, and timing in many protocols. A small step becomes a big task. (I’m kind of shuddering at the thought of needing to move to IPv6!) I can support the system now, with 3000 downloads to date but relatively few really active users. If someday I have lots of users and a StackOverflow.com tag of my very own, I’ll need a hand!

Qx Anything else you wish to add?

Ken Birman: Roberto, I just want to thank you for the great job you do with the ODBMS.org blog. I think it has become a great resource, and I hope this little interview gets a few of your readers interested in fooling around with Isis2! Tell them to feel free to contact me at Cornell, or ken@cs.cornell.edu/

—————–
Ken Birman is the N. Rama Rao Professor of Computer Science at Cornell. An ACM Fellow and the winner of the IEEE Tsutomu Kanai award, Ken has written 3 textbooks and published more than 150 papers in prestigious journals and conferences. Software he developed operated the New York Stock Exchange for more than a decade without trading disruptions, and played central roles in the French Air Traffic Control System (now expanding into much of Europe) and the US Navy AEGIS warship. Other technologies from his group found their way into IBM’s Websphere product, Amazon’s EC2 and S3 systems, Microsoft’s cluster management solutions. His latest system, Isis2 (isis2.codeplex.com) helps developers create secure, strongly consistent and scalable cloud computing solutions.

Resources

Ken Birmann`s Publication Listings.

ODBMS.org Cloud Data Stores – Lecture Notes: “Data Management in the Cloud” by Michael Grossniklaus, David Maier, Portland State University.
Lecture Notes | Intermediate/Advanced | English | DOWNLOAD ~280 slides (PDF)| 2011-12|

Related Posts

On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com. November 2, 2011

Follow ODBMS.org on Twitter: @odbmsorg
##

]]>
http://www.odbms.org/blog/2013/07/isis2-a-new-open-platform-for-data-replication-in-the-cloud-interview-with-ken-birman/feed/ 0