ODBMS Industry Watch » Internet of Things http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Sun, 02 Apr 2017 17:59:10 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.13 Democratizing the use of massive data sets. Interview with Dave Thomas. http://www.odbms.org/blog/2016/09/democratizing-the-use-of-massive-data-sets-interview-with-dave-thomas/ http://www.odbms.org/blog/2016/09/democratizing-the-use-of-massive-data-sets-interview-with-dave-thomas/#comments Mon, 12 Sep 2016 19:04:14 +0000 http://www.odbms.org/blog/?p=4234

“Any important data driving a business decision needs to be sanity checked, just as it would if one was using a spreadsheet.”–Dave Thomas.

I have interviewed Dave Thomas,Chief Scientist at Kx Labs.

RVZ

Q1. For many years business users have had their data locked up in databases and data warehouses. What is wrong with that?

Dave Thomas: It isn’t so much an issue of where the data resides, whether it is in files, databases, data warehouses or a modern data lake. The challenge is that modern businesses need access to the raw data, as well as the ability to rapidly aggregate and analyze their data.

Q2. Typical business intelligence (BI) tool users have never seen their actual data. Why?

Dave Thomas: For large corporations hardware and software both used to be prohibitively expensive, hence much of their data was aggregated prior to making it available to users. Even today when machines are very inexpensive most corporate IT infrastructures are impoverished relative to what one can buy on the street or in the Cloud.
Compounding the problem, IT charge-back mechanisms are biased to reduce IT spending rather than to maximize the value of data delivered to the business.
Traditional technologies are not sufficiently performant to allow processing of large volumes of data.
Many companies have inexpensive data lakes and have realized after the fact that using a commodity storage systems, such as HDFS, has severely constrained their performance and limited their utility. Hence more corporations are moving data away from HDFS into high-performance storage or memory.

Q3. What are the limitations of the existing BI and extract, transform and load (ETL) data tools?

Dave Thomas: Traditional BI tools assume that it is possible for DBAs and BI experts to a priori define the best way to structure and query the data. This reduces the whole power of BI to mere reporting. In an attempt to deal with huge BI backlogs, generic query and reporting tools have become popular to shift reporting to self-serve. However, they are often designed for sophisticated BI users rather than for normal business users. They are often not performant because they depend on the implementation of the underlying data stores.
For the most part, existing ETL tools are constrained by having to move the data to the ETL process and then on to the end user. Many ETL tools only work against one kind of data source. ETL can’t be written by normal users and due to the cost of an incorrect ETL run, such tools are not available to the data analyst. One of the major topics of discussion in Big Data shops is the complexity and performance of their Big Data pipeline. ETL, data blending, shouldn’t be a separate process or product. It should be something one can do with queries in a single efficient data language.

Q4. What are the typical technical challenges in finance, IoT and other time-series applications?

Dave Thomas:
1. Speed, as data volumes and variety are always increasing.
2. Ability to deal with both real-time events and historical events efficiently. Ideally in a single technology.
3. To handle time-series one needs to be able to deal with simultaneous arrival of events. Time with nanosecond precision is our solution. Other solutions are constrained by using milliseconds and event counters that are much less efficient.
4. High-performance operations on time, over days, months and years are essential for time-series. This is why time is a native type in Kx.
5. The essence of time-series is processing sliding time windows of data for both joins and aggregations.
6. In IOT, data is always dirty. Kx’s native support for missing data and out of band data due to failing sensors, allows one to deal with the realities of sensor data.

Q5. Kx offers analysts a language called q. Why not extend standard SQL?

Dave Thomas: I think there is a misunderstanding about q. Q is a full functional data language that both includes and extends SQL. Selects are easier than SQL because they provide implicit joins and group-bys. This makes queries roughly 50% of the code of SQL. Unlike many flavors of SQL, q lets one put a functional expression in any position in an SQL statement. One can easily extend the aggregation operations available to the end-user.

Q6. Can you show the difference between a query written in q and in standard SQL?

Dave Thomas: Here’s an example of retrieving parts from an orders table with a foreign key join to a parts table, summing by quantity and then sorting by color:

q:
select sum qty by p.color from sp

SQL:
select p.color, sum(sp.qty) from sp, p
where sp.p=p.p group by p.color order by color

Q7. How do queries execute inside the database?

Dave Thomas: Q is native to the database engine. Hence queries and analytics execute in the columns of the Kx database. There is no data shipping between the client and database server.

Q8. Shawn Rogers of Dell said: “A ‘citizen data scientist’ is an everyday, non-technical user that lacks the statistical and analytical prowess of a traditional data scientist, but is equally eager to leverage data in order to uncover insights, and importantly, do so at the speed of business.” What is your take on this?

Dave Thomas: High-performance data technologies, such as Kx, using modern large-memory hardware, can support data analysts versus data scientist queries. In the product Analyst for Kx, for example, users can work interactively on a sample of data using visual tools to import, clean, query, transform, analyze and visualize data with minimal, if any programming or even SQL. Given correct operations on one or more samples they then can be run against trillions of rows of data. Data analysts today can truly live in their data.

Q9. What are the risks of bringing the power of analytics to users who are non-expert programmers?

Dave Thomas: Clearly any important analysis needs to be validated and cross-checked. Hence any important data driving a business decision needs to be sanity checked, just as it would if one was using a spreadsheet.
In our experience users do make initial mistakes, but as they live in their data they quickly learn.
Visualization really helps, as does the provision of metadata about the data sources. Reducing the cycle time provides increased understanding, and allows one to make mistakes.
Runaway query performance has been a concern of DBAs, but for many years frameworks have been in place such as our smart query router that will ensure that ad hoc queries against massive datasets are throttled so they don’t run away. Fortunately, recent cost reductions in non-volatile memory make it possible to have high-performance query-only replicas of data that can be made available to different parts of the organization based on its needs.

Q10. How can non-expert programmers understand if the information expressed in visual analytics such as heat maps or in operational dashboard charts, is of good quality or not?

Dave Thomas: In our experience users spot visual anomalies much faster than inconsistencies in a spreadsheet.

Q11. What are the opportunities arising in “democratizing” the use of massive data sets?

Dave Thomas: We are finally living in a world where for many companies it is possible to run a real-time business where everyone can have fast, efficient access to the data they need. Rather than being held hostage to aggregations, spreadsheets and all sorts of variants of the truth, the organization can expediently see new opportunities to improve results in sales, marketing, production and other business operations.

Q12. How important is data query and data semantics?

Dave Thomas: Unfortunately we are not educated on how to express data semantics and data query.
Even computer scientists often study less about writing queries than how to execute them efficiently.
We need to educate students and employees on how to live in their data. It may well be that the future of programming for most will be writing queries. Given powerful data languages even compiler optimizations can be expressed by queries.
We need to invest much more in data governance and the use of standard terminology in order to share data within and across companies.

——————-
Dave Thomas, Kx Labs.
As Chief Scientist Dave envisions the future roadmap for Kx tools. Dave has had a long and storied career in computer software development and is perhaps best known as the founder and past CEO of Object Technology International, formerly OTI, now IBM OTI Labs, a pioneer in Agile Product Development. He was the principal visionary and architect for IBM VisualAge Smalltalk and Java tools and virtual machines including the popular open-source, multi-language Eclipse.org IDE. As the cofounder of Bedarra Research Labs he led the creation of the Ivy visual analytics workbench. Dave is a renowned speaker, university lecturer and Chairman of the Australian developer YOW! conferences.

Resources

New Kx release includes encryption, enhanced compression and Tableau integration. ODBMS.org JULY 4, 2016.

Resources for learning more about kdb+ and q benchmarking results.

Kdb+ and the Internet of Things/Big Data. InDetail Paper by Bloor Research Author: Philip Howard. ODBMS.org- JANUARY 28, 2015

Related Posts

Democratizing fast access to Big Data. By Dave Thomas, chief scientist at Kx Labs. ODBMS.org-April 26, 2016

On Data Governance. Interview with David Saul. ODBMS Industry Watch, Published on 2016-07-23

On the Challenges and Opportunities of IoT. Interview with Steve Graves. ODBMS Industry Watch, Published on 2016-07-06

On Data Analytics and the Enterprise. Interview with Narendra Mulani. ODBMS Industry Watch, Published on 2016-05-24

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/09/democratizing-the-use-of-massive-data-sets-interview-with-dave-thomas/feed/ 0
On the Challenges and Opportunities of IoT. Interview with Steve Graves http://www.odbms.org/blog/2016/07/on-the-challenges-and-opportunities-of-iot-interview-with-steve-graves/ http://www.odbms.org/blog/2016/07/on-the-challenges-and-opportunities-of-iot-interview-with-steve-graves/#comments Wed, 06 Jul 2016 09:00:29 +0000 http://www.odbms.org/blog/?p=4172

“Assembling a team with the wide range of skills needed for a successful IoT project presents an entirely different set of challenges. The skills needed to build a ‘thing’ are markedly different than the skills needed to implement the data analytics in the cloud.”–Steve Graves.

I have interviewed, Steve Graves, co-founder and CEO of McObject. Main topic of the interview is the Internet of Things and how it relates to databases.

RVZ

Q1. What are in your opinion the main Challenges and Opportunities of the Internet of Things (IoT) seen from the perspective of a database vendor?

Steve Graves: Let’s start with the opportunities.

When we started McObject in 2001, we chose “eXtremeDB, the embedded database for intelligent, connected devices” as our tagline. eXtremeDB was designed from the get-go to live in the “things” comprising what the industry now calls the Internet of Things. The popularization of this term has created a lot of visibility and, more importantly, excitement and buzz for what was previously viewed as the relatively boring “embedded systems.” And that creates a lot of opportunities.

A lot of really smart, creative people are thinking of innovative ways to improve our health, our workplace, our environment, our infrastructure, and more. That means new opportunities for vendors of every component of the technology stack.
The challenges are manifold, and I can’t begin to address all of them. The media is largely fixated on security, which itself is multi-dimensional.
We can talk about protecting IoT-enabled devices (e.g. your car) from being hacked. We can talk about protecting the privacy of your data at rest. And we can talk about protecting the privacy of data in motion.
Every vendor needs recognize the importance of security. But, it isn’t enough for a vendor, like McObject, to provide the features to secure the target system; the developer that assembles the stack along with their own proprietary technology to create an IoT solution needs to use available security features, and use them correctly.

After security, scaling IoT systems is the next big challenge. It’s easy enough to prototype something.
But careful planning is needed to leap from prototype to full-blown deployment. Obvious decisions have to be made about connectivity and necessary bandwidth, how many things per gateway, one tier of gateways or more, and how much compute capacity is needed in the cloud. Beyond that, there are less obvious decisions to be made that will affect scalability, like making sure the DBMS used on devices and/or gateways is able to handle the workload (e.g. that the gateway DBMS can scale from 10 input streams to 100 input streams); determining how to divide the analytics workload between gateways and the cloud; and ensuring that the gateway, its DBMS and its communication stack can stream data to the cloud while simultaneously processing its own input streams and analytics.
Assembling a team with the wide range of skills needed for a successful IoT project presents an entirely different set of challenges. The skills needed to build a ‘thing’ are markedly different than the skills needed to implement the data analytics in the cloud. In fact, ‘things’ are usually very much like good ol’ embedded systems, and system engineers that know their way around real-time/embedded operating systems, JTAG debuggers, and so on, have always been at a premium.

Q2. Data management for the IoT: What are the main differences between data management in field-deployed devices and at aggregation points?

Steve Graves: Quite simply: scale. A field-deployed device (or a gateway to field-deployed devices that do not, themselves, have any data management need or capability) has to manage a modest amount of data. But an aggregation point (the cloud being the most obvious example) has to manage many times more data – possibly orders of magnitude more.
At the same time, I have to say that they might not be all that different. Some IoT systems are going to be closed, meaning the nature of the things making up the system is known, and these won’t require much scaling. For example, a building automation system for a small- to mid-size building would have perhaps 100s of sensors and 10s of gateways, and may (or may not) push data up to a central aggregation point. If there are just 10s of gateways, we can create a UI that connects to the database on each gateway where each database is one shard of a single logical database, and execute analytics against that logical database without any need of a central aggregation point. We can extend this hypothetical case to a campus of buildings, or to a landlord with many buildings in a metropolitan area, and then a central aggregation point makes sense.

But the database system would not necessarily be different, only the organization of the physical and logical databases.
The gateways of each building would stream to a database server in the cloud. In the case of 10 buildings, we could have 10 database servers in the cloud that represent 10 shards of that logical database in the cloud. This architecture allows for great scalability. The landlord acquires another building? Great, stand up another database server and the UI connects to 11 shards instead of 10. In this scenario, database servers are software, not hardware. For the numbers we’re talking about (10 or 11 buildings), it could easily be handled by a single hardware server of modest ability.

At the other end of the scale (pun intended) are IoT systems that are wide open. By that, I mean the creators are not able to anticipate the universe of “things” that could be connected, or their quantity. In the first case, the database system should be able to ingest data that was heretofore unknown. This argues for a NoSQL database system, i.e. a database system that is schema-less. In this scenario, the database system on field-deployed devices is probably radically different from the database system in the cloud. Field-deployed devices are purpose-specific, so A) they don’t need and wouldn’t benefit from a NoSQL database system, and B) most NoSQL database systems are too resource-hungry to reside on embedded device nodes.

Q3. If we look at the characteristics of a database system for managing device-based data in the IoT, how do they differ from the characteristics of a database system (typically deployed on a server) for analyzing the “big data” generated by myriad devices?

Steve Graves: Again, let’s recognize that field-deployed devices in the IoT are classic embedded systems. In practical terms, that means relatively modest hardware like an ARM, MIPS, PowerPC or Atom processor running at 100s of megahertz, or perhaps 1 ghz if we’re lucky, and with only enough memory to perform its function. Further, it may require a real-time operating system, or at least an embedded operating system that is less resource hungry than a full-on Linux distro. So, for a database system to run in this environment, it will need to have been designed to run in this environment. It isn’t practical to try to shoehorn in a database system that was written on the assumption that CPU cycles and memory are abundant. It may also be the case that the device has little-to-no persistent storage, which mandates an in-memory database.

So a database system for a field-deployed device is going to
1. have a small code size
2. use little stack
3. preferably, allocate no heap memory
4. have no, or minimal, external dependencies (e.g. not link in an extra 1 MB of code from the C run-time library)
5. have built-in ability to replicate data (to a gateway or directly to the cloud)
a. Replication should be “open”, meaning be able to replicate to a different database system
6. Have built-in security features

7. Nice to have:
a. built-in analytics to aggregate data prior to replicating it
b. ability to define the schema
c. ability to operate entirely in memory

A database system for the cloud might benefit from being schema-less, as described previously. It should certainly have pretty elastic scalability. Servers in the cloud are going to have ample resources and robust operating systems. So a database system for the cloud doesn’t need to have a small code size, use a small amount of stack memory, or worry about external dependencies such as the C run-time library. On the contrary, a database system for the cloud is expected to do much more (handle data at scale, execute analytics, etc.) and will, therefore, need ample resources. In fact, this database system should be able to take maximum advantage of the resources available, including being able to scale horizontally (across cores, CPUs, and servers).
In summary, the edge (device-based) DBMS needs to operate in a constrained environment. A cloud DBMS needs to be able to effectively and efficiently utilize the ample resources available to it.

Q4. Why is the ability to define a database schema important (versus a schema-less DBMS, aka NoSQL) for field-deployed devices?

Steve Graves: Field-deployed devices will normally perform a few specific functions (sometimes, just one function). For example, a building automation system manages HVAC, lighting, etc. A livestock management system manages feed, output, and so on. In such systems, the data requirements are well known. The hallmark NoSQL advantage of being able to store data without predefining its structure is unwarranted. The other purported hallmark of NoSQL is horizontal scalability, but this is not a need for field-deployed devices.
Walking away from the relational database model (and its implicit use of a database schema) has serious implications.
A great deal of scientific knowledge has been amassed around the relational database model over the last few decades, and without it developers are completely on their own with respect to enforcing sound data management practices.

In the NoSQL sphere, there is nothing comparable to the relational model (e.g. E.F. Codd’s work) and the mathematical foundation (relational calculus) underpinning it.
There should be overwhelming justification for a decision to not use relational.
In my experience, that justification is absent for data management of field-deployed devices.
A database system that “knows” the data design (via a schema) can more intelligently manage the data. For example, it can manage constraints, domain dependencies, events and much more. And some of the purported inflexibility imposed by a schema can be eliminated if the DBMS supports dynamic DDL (see more details on this in the answer to question Q6, below).

Q5. In your opinion, do IoT aggregation points resemble data lakes?

Steve Graves: The term data lake was originally conceived in the context of Hadoop and map-reduce functionality. In more recent times, the meaning of the term has morphed to become synonymous with big data, and that is how I use the term. Insofar as a gateway can also be an aggregation point, I would not say ‘aggregation points resemble data lakes’ because gateway aggregation points, in all likelihood, will not manage Big Data.

Q6. What are the main technical challenges for database systems used to accommodate new and unforeseen data, for example when a new type of device begins streaming data?

Steve Graves: The obvious challenges are
1. The ability to ingest new data that has a previously unknown structure
2. The ability to execute analytics on #1
3. The ability to integrate analytics on #1 with analytics on previously known data

#1 is handled well by NoSQL DBMSs. But, it might also be handled well by an RDBMS via “dynamic DDL” (dynamic data definition language), e.g. the ability to execute CREATE TABLE, ALTER TABLE, and/or CREATE INDEX statements against an existing database.
To efficiently execute analytics against any data, the structure of the data must eventually be understood.
RDBMS handle this through the database dictionary (the binary equivalent of the data definition language).
But some NoSQL DBMSs handle this through different meta data. For example, the MarkLogic DBMS uses JSON metadata to understand the structure of documents in its document store.
NoSQL DBMSs with no meta data whatsoever put the entire burden on the developers. In other words, since the data is opaque to the DBMS, the application code must read and interpret the content.

Q7. Client/server DBMS architecture vs. in-process DBMSs: which one is more suitable for IoT?

Steve Graves: For edge DBMSs (on constrained devices), an in-process architecture will be more suitable. It requires fewer resources than client/server architecture, and imposes less latency through elimination of inter-process communication. For cloud DBMSs, a client/server architecture will be more suitable. In the cloud environment, resources are not scarce, and the the advantage of being able to scale horizontally will outweigh the added latency associated with client/server.

Qx Anything else you wish to add?

Steve Graves: We feel that eXtremeDB is uniquely positioned for the Internet of Things. Not only have devices and gateways been in eXtremeDB’s wheelhouse for 15 years with over 25 million real world deployments, but the scalability, time series data management, and analytics built into the eXtremeDB server (big data) offering make it an attractive cloud database solution as well. Being able to leverage a single DBMS across devices, gateways and the cloud has obvious synergistic advantages.

———————
Steve Graves is co-founder and CEO of McObject, a company specializing in embedded Database Management System (DBMS) software. Prior to McObject, Steve was president and chairman of Centura Solutions Corporation and vice president of worldwide consulting for Centura Software Corporation.

Resources

Big Data, Analytics, and the Internet of Things, by Mohak Shah, analytics leader and research scientist at Bosch Research, USA.ODBMS.org APRIL 6, 2015

 Privacy considerations & responsibilities in the era of Big Data & Internet of Things, by Ramkumar Ravichandran, Director, Analytics, Visa Inc. ODBMS.org January 8, 2015.

 Securing Your Largest USB-Connected Device: Your Car,BY Shomit Ghose, General Partner, ONSET Ventures, ODBMs.org MARCH 31, 2016.

 eXtremeDB Financial Edition DBMS Sweeps Records in Big Data Benchmark,ODBMS.org JULY 2, 2016

 eXtremeDB in-memory database

 User Experience Design for the Internet of Things

Related Posts

On the Internet of Things. Interview with Colin MahonyODBMS Industry Watch, Published on 2016-03-14

A Grand Tour of Big Data. Interview with Alan MorrisonODBMS Industry Watch, Published on 2016-02-25

On the Industrial Internet of Things. Interview with Leon Guzenda, ODBMS Industry Watch,  January 28, 2016

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/07/on-the-challenges-and-opportunities-of-iot-interview-with-steve-graves/feed/ 0
On Data Interoperability. Interview with Julie Lockner. http://www.odbms.org/blog/2016/06/on-data-interoperability-interview-with-julie-lockner/ http://www.odbms.org/blog/2016/06/on-data-interoperability-interview-with-julie-lockner/#comments Tue, 07 Jun 2016 16:47:14 +0000 http://www.odbms.org/blog/?p=4151

“From a healthcare perspective, how can we aggregate all the medical data, in all forms from multiple sources, such as wearables, home medical devices, MRI images, pharmacies and so on, and also blend in intelligence or new data sources, such as genomic data, so that doctors can make better decisions at the point of care?”– Julie Lockner.

I have interviewed Julie Lockner.  Julie leads data platform product marketing for InterSystems. Main topics of the interview are Data Interoperability and InterSystems` data platform strategy.

RVZ

Q1. Everybody is talking about Big Data — is the term obsolete?

Julie Lockner: Well, there is no doubt that the sheer volume of data is exploding, especially with the proliferation of smart devices and the Internet of Things (IoT). An overlooked aspect of IoT is the enormous volume of data generated by a variety devices, and how to connect, integrate and manage it all.

The real challenge, though, is not just processing all that data, but extracting useful insights from the variety of device types. Put another way, not all data is created using a common standard. You want to know how to interpret data from each device, know which data from what type of device is important, and which trends are noteworthy. Better information can create better results when it can be aggregated and analyzed consistently, and that’s what we really care about. Better, higher quality outcomes, not bigger data.

Q2. If not Big Data, where do we go from here?

Julie Lockner: We always want to be focusing on helping our customers build smarter applications to solve real business challenges, such as helping them to better compete on service, roll out high-quality products quicker, simplify processes – not build solutions in search of a problem. A canonical example is in retail. Our customers want to leverage insight from every transaction they process to create a better buying experience online or at the point of sale. This means being able to aggregate information about a customer, analyze what the customer is doing while on the website, and make an offer at transaction time that would delight them. That’s the goal – a better experience – because that is what online consumers expect.

From a healthcare perspective, how can we aggregate all the medical data, in all forms from multiple sources, such as wearables, home medical devices, MRI images, pharmacies and so on, and also blend in intelligence or new data sources, such as genomic data, so that doctors can make better decisions at the point of care? That implies we are analyzing not just more data, but better data that comes in all shapes and sizes, and that changes more frequently. It really points to the need for data interoperability.

Q3. What are the challenges software developers are telling you they have in today’s data-intensive world?

Julie Lockner: That they have too many database technologies to choose from and prefer to have a simple data platform architecture that can support multiple data models and multiple workloads within a single development environment.
We understand that our customers need to build applications that can handle a vast increase in data volume, but also a vast array of data types – static, non-static, local, remote, structured and non-structured. It must be a platform that coalesces all these things, brings services to data, offers a range of data models, and deals with data at any volume to create a more stable, long-term foundation. They want all of these capabilities in one platform – not a platform for each data type.

For software developers today, it’s not enough to pick elements that solve some aspect of a problem and build enterprise solutions around them; not all components scale equally. You need a common platform without sacrificing scalability, security, resilience, rapid response. Meeting all these demands with the right data platform will create a successful application.
And the development experience is significantly improved and productivity drastically increased when they can use a single platform that meets all these needs. This is why they work with InterSystems.

Q4. Traditionally, analytics is used with structured data, “slicing and dicing” numbers. But the traditional approach also involves creating and maintaining a data warehouse, which can only provide a historical view of data. Does this work also in the new world of Internet of Things?

Julie Lockner: I don’t think so. It is generally possible to take amorphous data and build it into a structured data model, but to respond effectively to rapidly changing events, you need to be able to take data in the form in which it comes to you.

If your data platform lacks certain fields, if you lack schema definition, you need to be able to capitalize on all these forms without generating a static model or a refinement process. With a data warehouse approach, it can take days or weeks to create fully cleansed, normalized data.
That’s just not fast enough in today’s always-on world – especially as machine-generated data is not conforming to a common format any time soon. It comes back to the need for a data platform that supports interoperability.

Q5. How hard is it to make decisions based on real-time analysis of structured and unstructured data?

Julie Lockner: It doesn’t have to be hard. You need to generate rules that feed rules engines that, in turn, drive decisions, and then constantly update those rules. That is a radical enhancement of the concept of analytics in the service of improving outcomes, as more real-time feedback loops become available.

The collection of changes we describe as Big Data will profoundly transform enterprise applications of the future. Today we can see the potential to drive business in new ways and take advantage of a convergence of trends, but it is not happening yet. Where progress has been made is the intelligence of devices and first-level data aggregation, but not in the area of services that are needed. We’re not there yet.

Q6. What’s next on the horizon for InterSystems in meeting the data platform requirements of this new world?

Julie Lockner: We continually work on our data platform, developing the most innovative ways we can think of to integrate with new technologies and new modes of thinking. Interoperability is a hugely important component. It may seem a simple task to get to the single most pertinent fact, but the means to get there may be quite complex. You need to be able to make the right data available – easily – to construct the right questions.

Data is in all forms and at varying levels of completeness, cleanliness, and accuracy. For data to be consumed as we describe, you need measures of how well you can use it. You need to curate data so it gets cleansed and you can cull what is important. You need flexibility in how you view data, too. Gathering data without imposing an orthodoxy or structure allows you to gain access to more data. Not all data will conform to a schema a priori.

Q7. Recently you conducted a benchmark test of an application based on InterSystems Caché®. Could you please summarize the main results you have obtained?

Julie Lockner: One of our largest customers is Epic Systems, one of the world’s top healthcare software companies.
Epic relies on Caché as the data platform for electronic medical record solutions serving more than half the U.S. patient population and millions of patients worldwide.

Epic tested the scalability and performance improvements of Caché version 2015.1. Almost doubling the scalability of prior versions, Caché delivers what Epic President Cark Dvorak has described as “a key strategic advantage for our user organizations that are pursuing large-scale medical informatics programs as well as aggressive growth strategies in preparation for the volume-to-value transformation in healthcare.”

Qx Anything else you wish to add?

Julie Lockner: The reason why InterSystems has succeeded in the market for so many years is a commitment to the success of those who depend on our technology. A recent Gartner Magic Quadrant report found we had the highest number of customers surveyed – 85% – who would buy from us again. That is the highest number of any vendor participating in that study.

The foundation of the company’s culture is all about helping our customers succeed. When our customers come to us with a challenge, we all pitch in to solve it. Many times our solutions may address an unusual problem that could benefit others – which then becomes the source of many of our innovations. It is one of the ways we are using problem-solving skills as a winning strategy to benefit others. When our customers are successful at using our engine to solve the world’s most important challenges, we all win.

——————-

Julie Lockner leads data platform product marketing for InterSystems. She has more than 20 years of experience in IT product marketing management and technology strategy, including roles at analyst firm ESG as well as Informatica and EMC.

—————–

Resources

“InterSystems Unveils Major New Release of Caché,” Feb. 25, 2015.

“Gartner Magic Quadrant for Operational DBMS, Donald Feinberg, Merv Adrian, Nick Heudecker, Adam M. Ronthal, and Terilyn Palanca, October 12, 2015, ID: G00271405.

– White Paper: Big Data Healthcare: Data Scalability with InterSystems Caché® and Intel® Processors (LINK to .PDF)

Related Posts

– A Grand Tour of Big Data. Interview with Alan Morrison. ODBMs Industry Watch, February 25, 2016

–  RIP Big Data. By Carl Olofson, Research Vice President, Data Management Software Research, IDC. ODBMS.org, JANUARY 6, 2016.

What is data blending. By Oleg Roderick, David Sanchez, Geisinger Data Science. ODBMS.org, November 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/06/on-data-interoperability-interview-with-julie-lockner/feed/ 0
On Data Analytics and the Enterprise. Interview with Narendra Mulani. http://www.odbms.org/blog/2016/05/on-data-analytics-and-the-enterprise-interview-with-narendra-mulani/ http://www.odbms.org/blog/2016/05/on-data-analytics-and-the-enterprise-interview-with-narendra-mulani/#comments Tue, 24 May 2016 16:31:20 +0000 http://www.odbms.org/blog/?p=4144

“A hybrid technology infrastructure that combines existing analytics architecture with new big data technologies can help companies to achieve superior outcomes.”–Narendra Mulani

I have interviewed Narendra MulaniChief Analytics Officer, Accenture Analytics. Main topics of our interview are: Data Analytics, Big Data, the Internet of Things, and their repercussion for the enterprise.

RVZ

Q1. What is your role at Accenture?

Narendra Mulani: I’m the Chief Analytics Officer at Accenture Analytics and I am responsible for building and inspiring a culture of analytics and driving Accenture’s strategic agenda for growth across the business. I lead a team of analytics professionals around the globe that are dedicated to helping clients transform into insight-driven enterprises and focused on creating value through innovative solutions that combine industry and functional knowledge with analytics and technology.

With the constantly increasing amount of data and new technologies becoming available, it truly is an exciting time for Accenture and our clients alike. I’m thrilled to be collaborating with my team and clients and taking part, first-hand, in the power of analytics and the positive disruption it is creating for businesses around globe.

Q2. What are the main drivers you see in the market for Big Data Analytics?

Narendra Mulani: Companies across industries are fighting to secure or keep their lead in the marketplace.
To excel in this competitive environment, they are looking to exploit one of their growing assets: Data.
Organizations see big data as a catalyst for their transformation into digital enterprises and as a way to secure an insight-driven competitive advantage. In particular, big data technologies are enabling companies with greater agility as it helps them to analyze data comprehensively and take more informed actions at a swifter pace. We’ve already passed the transition point with big data – instead of discussing the possibilities with big data, many are already experiencing the actual insight-driven benefits from it, including increased revenues, a larger base of loyal customers, and more efficient operations. In fact, we see our clients looking for granular solutions that leverage big data, advanced analytics and the cloud to address industry specific problems.

Q3. Analytics and Mobility: how do they correlate?

Narendra Mulani: Analytics and mobility are two digital areas that work hand-in-hand on many levels.
As an example, mobile devices and the increasingly connected world through the Internet of Things (IoT) have become two key drivers for big data analytics. As mobile devices, sensors, and the IoT are constantly creating new data sources and data types, big data analytics is being applied to transform the increasing amount of data into important and actionable insight that can create new business opportunities and outcomes. Also, this view can be reversed, where analytics feeds insight into mobile devices such as tablets to workers in offices or out in the field to enable them to make real-time decisions that could benefit their business.

Q4. Data explosion: What does it create ? Risks, Value or both?

Narendra Mulani: The data explosion that’s happening today and will continue to happen due to the Internet of Things creates a lot of opportunity for businesses. While organizations recognize the value that the data can generate, the sheer amount of data – internal data, external data, big data, small data, etc – can be overwhelming and create an obstacle for analytics adoption, project completion, and innovation. To overcome this challenge and pursue actionable insights and outcomes, organizations shouldn’t look to analyze all of the data that’s available, but identify the right data needed to solve the current project or challenge at hand to create value.

It’s also important for companies to manage the potential risk associated with the influx of data and take the steps needed to optimize and protect it. They can do this by aligning IT and business leads to jointly develop and maintain data governance and security strategies. At a high level, the strategies would govern who uses the data and how the data is analyzed and leveraged, define the technologies that would manage and analyze the data, and ensure the data is secured with the necessary standards. Suitable governance and security strategies should be requirements for insight-driven businesses. Without them, organizations could experience adverse and counter-productive results.

Q5. You introduced the concept of the “Modern Data Supply Chain”? How does it differ from the traditional Supply Chain?

Narendra Mulani: As companies’ data ecosystems are usually very complex with many data silos, a modern data supply chain helps them to simplify their data environment and generate the most value from their data. In brief, when data is treated as a supply chain, it can flow swiftly, easily and usefully through the entire organization— and also through its ecosystem of partners, including customers and suppliers.

To establish an effective modern data supply chain, companies should create a hybrid technology environment that enables a data service platform with emerging big data technologies. As a result, businesses will be able to access, manage, move, mobilize and interact with broader and deeper data sets across the organization at a much quicker pace than previously possible and place action on the attained analytics insights that could help it to more effectively deliver to its consumers, develop new innovative solutions, and differentiate in its market.

Q6. You talked about “Retooling the Enterprise”. What do you mean by this?

Narendra Mulani: Some businesses today are no longer just using analytics, they are taking the next step by transforming into insight-driven enterprises. To achieve “insight-driven enterprise” status, organizations need to retool themselves for optimization. They can pursue an insight-driven transformation by:

· Establishing a center of gravity for analytics – a center of gravity for analytics often takes the shape of a Center of Excellence or a similar concentration of talent and resources.
· Employing agile governance – build horizontal governance structures that are focused on outcomes and speed to value, and take a “test and learn” approach to rolling out new capabilities. A secure governance foundation could also improve the democratization of data throughout a business.
· Creating an inter-disciplinary high performing analytics team — field teams with diverse skills, organize talent effectively, and create innovative programs to keep the best talent engaged.
· Deploying new capabilities faster – deploy new, modern and agile technologies, as well as hybrid architectures and specifically designed toolsets, to help revolutionize how data has been traditionally managed, curated and consumed, to achieve speed to capability and desired outcomes. When appropriate, cloud technologies should be integrated into the IT mix to benefit from cloud-based usage models.
· Raising the company’s analytics IQ – have a vision of what would be your “intelligent enterprise” and implement an Analytics Academy that provides analytics training for functional business resources in addition to the core management training programs.

Q7. What are the risks from the Internet of Things? And how is it possible to handle such risks?

Narendra Mulani: The IoT is prompting an even greater focus on data security and privacy. As a company’s machines, employees and ecosystems of partners, providers, and customers become connected through the IoT, securing the data that is flowing across the IoT grid can be increasingly complex. Today’s sophisticated cyber attackers are also amplifying this complexity as they are constantly evolving and leveraging data technology to challenge a company’s security efforts.

To establish strong, effective real-time cyber defense strategy, security teams will need to employ innovative technologies to identify threat behavioral patterns — including artificial intelligence, automation, visualisation, and big data analytics – and an agile and fluid workforce to leverage the opportunities presented by technology innovations. They should also establish policies to address privacy issues that arise out of all the personal data that are being collected. Through this combination of efforts, companies will be able to strengthen its approach to cyber defense in today’s highly connected IoT world and empower cyber defenders to help their companies better anticipate and respond to cyber attacks.

Q8. What are the main lessons you have learned in implementing Big Data Analytic projects?

Narendra Mulani: Organizations should explore the entire big data technology ecosystem, take an outcome-focused approach to addressing specific business problems, and establish precise success metrics before an analytics project even begins. The big data landscape is in a constant state of change with new data sources and emerging big data technologies appearing every day that could offer a company a new value-generating opportunity. A hybrid technology infrastructure that combines existing analytics architecture with new big data technologies can help companies to achieve superior outcomes.
An outcome-focused strategy that embraces analytics experimentation and explores the possible data and technology that can help a company meet its goals and has checkpoints for measuring performance will be very valuable, as this strategy will help the analytics team to know if they should continue on course or need to make a course correction to attain the desired outcome.

Q9. Is Data Analytics only good for businesses? What about using (Big) Data for Societal issues?

Narendra Mulani: Analytics is helping businesses across industries and governments as well to make more informed decisions for effective outcomes, whether it might be to improve customer experience, healthcare or public safety.
As an example, we’re working with a utility company in the UK to help them leverage analytics insights to anticipate equipment failures and respond in near real-time to critical situations, such as leaks or adverse weather events. We are also working with a government agency to analyze its video monitoring feeds to identify potential public safety risks.

Qx Anything else you wish to add?

Narendra Mulani: Another area that’s on the rise is Artificial Intelligence – we define it as a collection of multiple technologies that enable machines to sense, comprehend, act and learn, either on their own or to augment human activities. The new technologies include machine learning, deep learning, natural language processing, video analytics and more. AI is disrupting how businesses operate and compete and we believe it will also fundamentally transform and improve how we work and live. When an organization is pursuing an AI project, it’s our belief that it should be business-oriented, people-focused, and technology rich for it to be most effective.

———

As Chief Analytics Officer and Head Geek – Accenture Analytics, Narendra Mulani is responsible for creating a culture of analytics and driving Accenture’s strategic agenda for growth across the business. He leads a dedicated team of 17,000 Analytic professionals that serve clients around the globe, focusing on value creation through innovative solutions that combine industry and functional knowledge with analytics and technology.

Narendra has held a number of leadership roles within Accenture since joining in 1997. Most recently, he was the managing director – Products North America, where he was responsible for creating value for our clients across a number of industries. Prior to that, he was managing director – Supply Chain, Accenture Management Consulting, leading a global practice responsible for defining and implementing supply chain capabilities at a diverse set of Fortune 500 clients.

Narendra graduated from Bombay University in 1978 with a Bachelor of Commerce, and received an MBA in Finance in 1982 as well as a PhD in 1985 focused on Multivariate Statistics, both from the University of Massachusetts.

Outside of work, Narendra is involved with various activities that support education and the arts. He lives in Connecticut with his wife Nita and two children, Ravi and Nikhil.

———-

Resources

– Ducati is Analytics Driven. Analytics takes Ducati around the world at speed and precision.

Accenture Analytics. Launching an insights-driven transformation.  Download the point of view on analytics operating models to better understand how high performing companies are organizing their capabilities.

– Accenture Cyber Intelligence Platform. Analytics helping organizations to continuously predict, detect and combat cyber attacks.

–  Data Acceleration: Architecture for the Modern Data Supply Chain, Accenture

Related Posts

On Big Data and Data Science. Interview with James KobielusSource: ODBMS Industry Watch,  2016-04-19

On the Internet of Things. Interview with Colin Mahony Source: ODBMS Industry Watch, 2016-03-14

A Grand Tour of Big Data. Interview with Alan MorrisonSource: ODBMS Industry Watch, 2016-02-25

On the Industrial Internet of Things. Interview with Leon GuzendaSource: ODBMS Industry Watch,  2016-01-28

On Artificial Intelligence and Society. Interview with Oren EtzioniSource: ODBMS Industry Watch,  2016-01-15

 

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/05/on-data-analytics-and-the-enterprise-interview-with-narendra-mulani/feed/ 0
On the Internet of Things. Interview with Colin Mahony http://www.odbms.org/blog/2016/03/on-the-internet-of-things-interview-with-colin-mahony/ http://www.odbms.org/blog/2016/03/on-the-internet-of-things-interview-with-colin-mahony/#comments Mon, 14 Mar 2016 08:45:56 +0000 http://www.odbms.org/blog/?p=4101

“Frankly, manufacturers are terrified to flood their data centers with these unprecedented volumes of sensor and network data.”– Colin Mahony

I have interviewed Colin Mahony, SVP & General Manager, HPE Big Data Platform. Topics of the interview are: The challenges of the Internet of Things, the opportunities for Data Analytics, the positioning of HPE Vertica and HPE Cloud Strategy.

RVZ

Q1. Gartner says 6.4 billion connected “things” will be in use in 2016, up 30 percent from 2015.  How do you see the global Internet of Things (IoT) market developing in the next years?

Colin Mahony: As manufacturers connect more of their “things,” they have an increased need for analytics to derive insight from massive volumes of sensor or machine data. I see these manufacturers, particularly manufacturers of commodity equipment, with a need to provide more value-added services based on their ability to provide higher levels of service and overall customer satisfaction. Data analytics platforms are key to making that happen. Also, we could see entirely new analytical applications emerge, driven by what consumers want to know about their devices and combine that data with, say, their exercise regimens, health vitals, social activities, and even driving behavior, for full personal insight.
Ultimately, the Internet of Things will drive a need for the Analyzer of Things, and that is our mission.

Q2. What Challenges and Opportunities bring the Internet of Things (IoT)? 

Colin Mahony: Frankly, manufacturers are terrified to flood their data centers with these unprecedented volumes of sensor and network data. The reason? Traditional data warehouses were designed well before the Internet of Things, or, at least before OT (operational technology) like medical devices, industrial equipment, cars, and more were connected to the Internet. So, having an analytical platform to provide the scale and performance required to handle these volumes is important, but customers are taking more of a two- or three-tier approach that involves some sort of analytical processing at the edge before data is sent to an analytical data store. Apache Kafka is also becoming an important tier in this architecture, serving as a message bus, to collect and push that data from the edge in streams to the appropriate database, CRM system, or analytical platform for, as an example, correlation of fault data over months or even years to predict and prevent part failure and optimize inventory levels.

Q3. Big Data: In your opinion, what are the current main demands/needs in the market?

Colin Mahony: All organizations want – and need – to become data-driven organizations. I mean, who wants to make such critical decisions based on half answers and anecdotal data? That said, traditional companies with data stores and systems going back 30-40 years don’t have the same level playing field as the next market disruptor that just received their series B funding and only knows that analytics is the life blood of their business and all their critical decisions.
The good news is that whether you are a 100-year old insurance company or the next Uber or Facebook, you can become a data-driven organization by taking an open platform approach that uses the best tool for the job and can incorporate emerging technologies like Kafka and Spark without having to bolt on or buy all of that technology from a single vendor and get locked in.  Understanding the difference between an open platform with a rich ecosystem and open source software as one very important part of that ecosystem has been a differentiator for our customers.

Beyond technology, we have customers that establish analytical centers of excellence that actually work with the data consumers – often business analysts – that run ad-hoc queries using their preferred data visualization tool to get the insight need for their business unit or department. If the data analysts struggle, then this center of excellence, which happens to report up through IT, collaborates with them to understand and help them get to the analytical insight – rather than simply halting the queries with no guidance on how to improve.

Q4. How do you embed analytics and why is it useful? 

Colin Mahony: OEM software vendors, particularly, see the value of embedding analytics in their commercial software products or software as a service (SaaS) offerings.  They profit by creating analytic data management features or entirely new applications that put customers on a faster path to better, data-driven decision making. Offering such analytics capabilities enables them to not only keep a larger share of their customer’s budget, but at the same time greatly improve customer satisfaction. To offer such capabilities, many embedded software providers are attempting unorthodox fixes with row-oriented OLTP databases, document stores, and Hadoop variations that were never designed for heavy analytic workloads at the volume, velocity, and variety of today’s enterprise. Alternatively, some companies are attempting to build their own big data management systems. But such custom database solutions can take thousands of hours of research and development, require specialized support and training, and may not be as adaptable to continuous enhancement as a pure-play analytics platform. Both approaches are costly and often outside the core competency of businesses that are looking to bring solutions to market quickly.

Because it’s specifically designed for analytic workloads, HPE Vertica is quite different from other commercial alternatives. Vertica differs from OLTP DBMS and proprietary appliances (which typically embed row-store DBMSs) by grouping data together on disk by column rather than by row (that is, so that the next piece of data read off disk is the next attribute in a column, not the next attribute in a row). This enables Vertica to read only the columns referenced by the query, instead of scanning the whole table as row-oriented databases must do. This speeds up query processing dramatically by reducing disk I/O.

You’ll find Vertica as the core analytical engine behind some popular products, including Lancope, Empirix, Good Data, and others as well as many HPE offerings like HPE Operations Analytics, HPE Application Defender, and HPE App Pulse Mobile, and more.

Q5. How do you make a decision when it is more appropriate to “consume and deploy” Big Data on premise, in the cloud, on demand and on Hadoop?

Colin Mahony: The best part is that you don’t need to choose with HPE. Unlike most emerging data warehouses as a service where your data is trapped in their databases when your priorities or IT policies change, HPE offers the most complete range of deployment and consumption models. If you want to spin up your analytical initiative on the cloud for a proof-of-concept or during the holiday shopping season for e-retailers, you can do that easily with HPE Vertica OnDemand.
If your organization finds that due to security or confidentiality or privacy concerns you need to bring your analytical initiative back in house, then you can use HPE Vertica Enterprise on-premises without losing any customizations or disruption to your business. Have petabyte volumes of largely unstructured data where the value is unknown? Use HPE Vertica for SQL on Hadoop, deployed natively on your Hadoop cluster, regardless of the distribution you have chosen. Each consumption model, available in the cloud, on-premise, on-demand, or using reference architectures for HPE servers, is available to you with that same trusted underlying core.

Q6. What are the new class of infrastructures called “composable”? Are they relevant for Big Data?

Colin Mahony: HPE believes that a new architecture is needed for Big Data – one that is designed to power innovation and value creation for the new breed of applications while running traditional workloads more efficiently.
We call this new architectural approach Composable Infrastructure. HPE has a well-established track record of infrastructure innovation and success. HPE Converged Infrastructure, software-defined management, and hyper-converged systems have consistently proven to reduce costs and increase operational efficiency by eliminating silos and freeing available compute, storage, and networking resources. Building on our converged infrastructure knowledge and experience, we have designed a new architecture that can meet the growing demands for a faster, more open, and continuous infrastructure.

Q7. What is HPE Cloud Strategy? 

Colin Mahony: Hybrid cloud adoption is continuing to grow at a rapid rate and a majority of our customers recognize that they simply can’t achieve the full measure of their business goals by consuming only one kind of cloud.
HPE Helion not only offers private cloud deployments and managed private cloud services, but we have created the HPE Helion Network, a global ecosystem of service providers, ISVs, and VARs dedicated to delivering open standards-based hybrid cloud services to enterprise customers. Through our ecosystem, our customers gain access to an expanded set of cloud services and improve their abilities to meet country-specific data regulations.

In addition to the private cloud offerings, we have a strategic and close alliance with Microsoft Azure, which enables many of our offerings, including Haven OnDemand, in the public cloud. We also work closely with Amazon because our strategy is not to limit our customers, but to ensure that they have the choices they need and the services and support they can depend upon.

Q8. What are the advantages of an offering like Vertica in this space?

Colin Mahony: More and more companies are exploring the possibility of moving their data analytics operations to the cloud. We offer HPE Vertica OnDemand, our data warehouse as a service, for organizations that need high-performance enterprise class data analytics for all of their data to make better business decisions now. Built by design to drastically improve query performance over traditional relational database systems, HPE Vertica OnDemand is engineered from the same technology that powers the HPE Vertica Analytics Platform. For organizations that want to select Amazon hardware and still maintain the control over the installation, configuration, and overall maintenance of Vertica for ultimate performance and control, we offer Vertica AMI (Amazon Machine Image). The Vertica AMI is a bring-your-own-license model that is ideal for organizations that want the same experience as on-premise installations, only without procuring and setting up hardware. Regardless of which deployment model to choose, we have you covered for “on demand” or “enterprise cloud” options.

Q9. What is HPE Vertica Community Edition?

Colin Mahony: We have had tens of thousands of downloads of the HPE Vertica Community Edition, a freemium edition of HPE Vertica with all of the core features and functionality that you experience with our core enterprise offering. It’s completely free for up to 1 TB of data storage across three nodes. Companies of all sizes prefer the Community Edition to download, install, set-up, and configure Vertica very quickly on x86 hardware or use our Amazon Machine Image (AMI) for a bring-your-own-license approach to the cloud.

Q10. Can you tell us how Kiva.org, a non-profit organization, uses on-demand cloud analytics to leverage the internet and a worldwide network of microfinance institutions to help fight poverty? 

Colin Mahony: HPE is a major supporter of Kiva.org, a non-profit organization with a mission to connect people through lending to alleviate poverty. Kiva.org uses the internet and a worldwide network of microfinance institutions to enable individuals lend as little as $25 to help create opportunity around the world. When the opportunity arose to help support Kiva.org with an analytical platform to further the cause, we jumped at the opportunity. Kiva.org relies on Vertica OnDemand to reduce capital costs, leverage the SaaS delivery model to adapt more quickly to changing business requirements, and work with over a million lenders, hundreds of field partners and volunteers, across the world. To see a recorded Webinar with HPE and Kiva.org, see here.

Qx Anything else you wish to add?

Colin Mahony: We appreciate the opportunity to share the features and benefits of HPE Vertica as well as the bright market outlook for data-driven organizations. However, I always recommend that any organization that is struggling with how to get started with their analytics initiative to speak and meet with peers to learn best practices and avoid potential pitfalls. The best way to do that, in my opinion, is to visit with the more than 1,000 Big Data experts in Boston from August 29 – September 1st at the HPE Big Data Conference. Click here to learn more and join us for 40+ technical deep-dive sessions.

————-

Colin Mahony, SVP & General Manager, HPE Big Data Platform

Colin Mahony leads the Hewlett Packard Enterprise Big Data Platform business group, which is responsible for the industry leading Vertica Advanced Analytics portfolio, the IDOL Enterprise software that provides context and analysis of unstructured data, and Haven OnDemand, a platform for developers to leverage APIs and on demand services for their applications.
In 2011, Colin joined Hewlett Packard as part of the highly successful acquisition of Vertica, and took on the responsibility of VP and General Manager for HP Vertica, where he guided the business to remarkable annual growth and recognized industry leadership. Colin brings a unique combination of technical knowledge, market intelligence, customer relationships, and strategic partnerships to one of the fastest growing and most exciting segments of HP Software.

Prior to Vertica, Colin was a Vice President at Bessemer Venture Partners focused on investments primarily in enterprise software, telecommunications, and digital media. He established a great network and reputation for assisting in the creation and ongoing operations of companies through his knowledge of technology, markets and general management in both small startups and larger companies. Prior to Bessemer, Colin worked at Lazard Technology Partners in a similar investor capacity.

Prior to his venture capital experience, Colin was a Senior Analyst at the Yankee Group serving as an industry analyst and consultant covering databases, BI, middleware, application servers and ERP systems. Colin helped build the ERP and Internet Computing Strategies practice at Yankee in the late nineties.

Colin earned an M.B.A. from Harvard Business School and a bachelor’s degrees in Economics with a minor in Computer Science from Georgetown University.  He is an active volunteer with Big Brothers Big Sisters of Massachusetts Bay and the Joey Fund for Cystic Fibrosis.

Resources

What’s in store for Big Data analytics in 2016, Steve Sarsfield, Hewlett Packard Enterprise. ODBMS.org, 3 FEB, 2016

What’s New in Vertica 7.2?: Apache Kafka Integration!, HPE, last edited February 2, 2016

Gartner Says 6.4 Billion Connected “Things” Will Be in Use in 2016, Up 30 Percent From 2015, Press release, November 10, 2015

The Benefits of HP Vertica for SQL on Hadoop, HPE, July 13, 2015

Uplevel Big Data Analytics with Graph in Vertica – Part 5: Putting graph to work for your business , Walter Maguire, Chief Field Technologist, HP Big Data Group, ODBMS.org, 2 Nov, 2015

HP Distributed R ,ODBMS.org,  19 FEB, 2015.

Understanding ROS and WOS: A Hybrid Data Storage Model, HPE, October 7, 2015

Related Posts

On Big Data Analytics. Interview with Shilpa LawandeSource: ODBMS Industry Watch, Published on December 10, 2015

On HP Distributed R. Interview with Walter Maguire and Indrajit RoySource: ODBMS Industry Watch, Published on April 9, 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/03/on-the-internet-of-things-interview-with-colin-mahony/feed/ 0
On the Industrial Internet of Things. Interview with Leon Guzenda http://www.odbms.org/blog/2016/01/on-the-industrial-internet-of-things-interview-with-leon-guzenda/ http://www.odbms.org/blog/2016/01/on-the-industrial-internet-of-things-interview-with-leon-guzenda/#comments Thu, 28 Jan 2016 10:40:16 +0000 http://www.odbms.org/blog/?p=4066

“Apart from security, the single biggest new challenges that the Industrial Internet of Things poses are the number of devices involved, the rate that many of them can generate data and the database and analytical requirements.” –Leon Guzenda.

I have interviewed Leon Guzenda, Chief Technical Marketing Officer at Objectivity. Topics of the interview are data analytics, the Industrial Internet of Things (IIoT), and ThingSpan.

RVZ

Q1. What is the difference between Big Data and Fast Data?

Leon Guzenda: Big Data is a generic term for datasets that are too large or complex to be handled with traditional technology. Fast Data refers to streams of data that must be processed or acted upon immediately once received.
If most, or all, of it is stored, it will probably end up as Big Data. Hadoop standardized the parallel processing approach for Big Data, and HDFS provided a resilient storage infrastructure. Meanwhile, Complex Event Processing became the main way of dealing with fast-moving streams of data, applying business logic and triggering event processing. Spark is a major step forward in controlling workflows that have streaming, batch and interactive elements, but it only offers a fairly primitive way to bridge the gap between the Fast and Big Data worlds via tabular RDDs or DataFrames.

ThingSpan, Objectivity’s new information fusion platform, goes beyond that. It integrates with Spark Streaming and HDFS to provide a dynamic Metadata Store that holds information about the many complex relationships between the objects in the Hadoop repository or elsewhere. It can be used to guide data mining using Spark SQL or GraphX and analytics using Spark MLlib.

Q2. Shawn Rogers, Chief Research Officer, Dell Statistica recently said in an interview: “A ‘citizen data scientist’ is an everyday, non-technical user that lacks the statistical and analytical prowess of a traditional data scientist, but is equally eager to leverage data in order to uncovering insights, and importantly, do so at the speed business”. What is your take on this?

Leon Guzenda:  It’s a bit like the difference between amateur and professional astronomers.
There are far more data users than trained data scientists, and it’s important that the data users have all of the tools needed to extract value from their data. Things filter down from the professionals to the occasional users. I’ve heard the term “NoHow” applied to tools that make this possible. In other words, the users don’t have to understand the intricacy of the algorithms. They only need to apply them and interpret the results. We’re a long way from that with most kinds of data, but there is a lot of work in this area.

We are making advances in visual analytics, but there is also a large and rapidly growing set of algorithms that the tool builders need to make available. Users should be able to define their data sources, say roughly what they’re looking for and let the tool assemble the workflow and visualizers. We like the idea of “Citizen Data Scientists” being able to extract value from their data more efficiently, but let’s not forget that data blending at the front end is still a challenge and may need some expert help.

That’s another reason why the ThingSpan Metadata Store is important. An expert can describe the data there in terms that are familiar to the user. Applying the wrong analytical algorithm can produce false patterns, particularly when the data has been sampled inadequately. Once again, having an expert constrain those of particular algorithms to certain types of data can make it much more likely that the Citizen Data Scientists will obtain useful results.

Q3. Do we really need the Internet of Things?

Leon Guzenda: That’s a good question. It’s only worth inventing a category if the things that it applies to are sufficiently different from other categories to merit it. If we think of the Internet as a network of connected networks that share the same protocol, then it isn’t necessary to define exactly what each node is. The earliest activities on the Internet were messaging, email and file sharing. The WWW made it possible to set up client-server systems that ran over the Internet. We soon had “push” systems that streamed messages to subscribers rather than having them visit a site and read them. One of the fastest growing uses is the streaming of audio and video. We still haven’t overcome some of the major issues associated with the Internet, notably security, but we’ve come a long way.

Around the turn of the century it became clear that there are real advantages in connecting a wider variety of devices directly to each other in order to improve their effectiveness or an overall system. Separate areas of study, such as smart power grids, cities and homes, each came to the conclusion that new protocols were needed if there were no humans tightly coupled to the loop. Those efforts are now converging to the discipline that we call the Internet of Things (IoT), though you only have to walk the exhibitor hall at any IoT conference to find that we’re at about the same point as we were in the early NoSQL conferences. Some companies have been tackling the problems for many years whilst others are trying to bring value by making it easier to handle connectivity, configuration, security, monitoring, etc.

The Industrial IoT (IIoT) is vital, because it can help improve our quality of life and safety whilst increasing the efficiency of the systems that serve us. The IIoT is a great opportunity for some of the database vendors, such as Objectivity, because we’ve been involved with companies or projects tackling these issues for a couple of decades, notably in telecoms, process control, sensor data fusion, and intelligence analysis. New IoT systems generally need to store data somewhere and make it easy to analyze. That’s what we’re focused on, and why we decided to build ThingSpan, to leverage our existing technology with new open source components to enable real-time relationship and pattern discovery of IIoT applications.

Q4. What is special about the Industrial Internet of Things? And what are the challenges and opportunities in this area?

Leon Guzenda:. Apart from security, the single biggest new challenges that the IIoT poses are the number of devices involved, the rate that many of them can generate data and the database and analytical requirements. The number of humans on the planet is heading towards eight billion, but not all of them have Internet access. The UN expects that there will be around 11 billion of us by 2100. There are likely to be around 25 billion IIoT devices by 2020.

There is growing recognition and desire by organizations to better utilize their sensor-based data to gain competitive advantage. According to McKinsey & Co., organizations in many industry segments are currently using less than 5% of data from their sensors. Better utilization of sensor-based data could lead to a positive impact of up to $11.1 Trillion per year by 2025 through improved productivities.

Q5. Could you give us some examples of predictive maintenance and asset management within the Industrial IoT?

Leon Guzenda:  Yes, neither use case is new nor directly the result of the IIoT, but the IIoT makes it easier to collect, aggregate and act upon information gathered from devices. We have customers building telecom, process control and smart building management systems that aggregate information from multiple customers in order to make better predictions about when equipment should be tweaked or maintained.

One of our customers provides systems for conducting seismic surveys for oil and gas companies and for helping them maximize the yield from the resources that they discover. A single borehole can have 10,000 sensors in the equipment at the site.
That’s a lot of data to process in order to maintain control of the operation and avoid problems. Replacing a broken drill bit can take one to three days, with the downtime costing between $1 million and $3.5 million. Predictive maintenance can be used to schedule timely replacement or servicing of the drill bit, reducing the downtime to three hours or so.

There are similar case studies across industries. The CEO of one of the world’s largest package transportation companies said recently that saving a single mile off of every driver’s route resulted in savings of $50 million per year! Airlines also use predictive maintenance to service engines and other aircraft parts to keep passengers safely in the air, and mining companies use GPS tracking beacons on all of their assets to schedule the servicing of vital and very costly equipment optimally. Prevention is much better than treatment when it comes to massive or expensive equipment.

Q6. What is ThingSpan? How is it positioned in the market?

Leon Guzenda:  ThingSpan is an information fusion software platform, architected for performance and extensibility, to accelerate time-to-production of IoT applications. ThingSpan is designed to seat between streaming analytics platforms and Big Data platforms in the Fast Data pipeline to create contextual information in the form of transformed data and domain metadata from streaming data and static, historical data. Its main differentiators from other tools in the field are its abilities to handle concurrent high volume ingest and pathfinding query loads.

ThingSpan is built around object-data management technology that is battle-tested in data fusion solutions in production use with U.S. government and Fortune 1000 organizations. It provides out-of-the-box integration with Spark and Hadoop 2.0 as well as other major open source technologies. Objectivity has been bridging the gap between Big Data and Fast Data within the IIoT for leading government agencies and commercial enterprises for decades, in industries such as manufacturing, oil and gas, utilities, logistics and transportation, and telecommunications. Our software is embedded as a key component in several custom IIoT applications, such as management of real-time sensor data, security solutions, and smart grid management.

Q7. Graphs are hard to scale. How do you handle this in ThingSpan?

Leon Guzenda: ThingSpan is based on our scalable, high-performance, distributed object database technology. ThingSpan isn’t constrained to graphs that can be handled in memory, nor is it dependent upon messaging between vertices in the graph. The address space could be easily expanded to the Yottabyte range or beyond, so we don’t expect any scalability issues. The underlying kernel handles difficult tasks, such as pathfinding between nodes, so performance is high and predictable. Supplementing ThingSpan’s database capabilities with the algorithms available via Spark GraphX makes it possible for users to handle a much broader range of tasks.

We’ve also noted over the years that most graphs aren’t as randomly connected as you might expect. We often see clusters of subgraphs, or dandelion-like structures, that we can use to optimize the physical placement of portions of the graph on disk. Having said that, we’ve also done a lot of work to reduce the impact of supernodes (ones with extremely large numbers of connections) and to speed up pathfinding in the cases where physical clustering doesn’t work.

Q8. Could you describe how ThingSpan’s graph capabilities can be beneficial for use cases, such as cybersecurity, fraud detection and anti-money laundering in financial services, to name a few?

Leon Guzenda: Each of those use cases, particularly cybersecurity, deals with fast-moving streams of data, which can be analyzed by checking thresholds in individual pieces of data or accumulated statistics. ThingSpan can be used to correlate the incoming (“Fast”) data that is handled by Spark Streaming with a graph of connections between devices, people or institutions. At that point, you can recognize Denial of Service attacks, fraudulent transactions or money laundering networks, all of which will involve nodes representing suspicious people or organizations.
The faster you can do this, the more chance you have of containing a cybersecurity threat or preventing financial crimes.

Q9. Objectivity has traditionally focused on a relatively narrow range of verticals. How do you intend to support a much broader range of markets than your current base?

Leon Guzenda:  Our base has evolved over the years and the number of markets has expanded since the industry’s adoption of Java and widespread acceptance of NoSQL technology. We’ve traditionally maintained a highly focused engineering team and very responsive product support teams at our headquarters and out in the field. We have never attempted to be like Microsoft or Apple, with huge teams of customer service people handling thousands of calls per day. We’ve worked with VARs that embed our products in their equipment or with system integrators that build highly complex systems for their government and industry customers.

We’re expanding this approach with ThingSpan by working with the open source community, as well as building partnerships with technology and service providers. We don’t believe that it’s feasible or necessary to suddenly acquire expertise in a rapidly growing range of disciplines and verticals. We’re happy to hand much of the service work over to partners with the right domain expertise while we focus on strengthening our technologies. We recently announced a technology partnership with Intel via their Trusted Analytics Platform (TAP) initiative. We’ll soon be announcing certification by key technology partners and the completion of major proof of concept ThingSpan projects. Each of us will handle a part of a specific project, supporting our own products or providing expertise and working together to improve our offerings.

———
Leon Guzenda, Chief Technical Marketing Officer at Objectivity
Leon Guzenda was one of the founding members of Objectivity in 1988 and one of the original architects of Objectivity/DB.
He currently works with Objectivity’s major customers to help them effectively develop and deploy complex applications and systems that use the industry’s highest-performing, most reliable DBMS technology, Objectivity/DB. He also liaises with technology partners and industry groups to help ensure that Objectivity/DB remains at the forefront of database and distributed computing technology.
Leon has more than five decades of experience in the software industry. At Automation Technology Products, he managed the development of the ODBMS for the Cimplex solid modeling and numerical control system.
Before that, he was Principal Project Director for International Computers Ltd. in the United Kingdom, delivering major projects for NATO and leading multinationals. He was also design and development manager for ICL’s 2900 IDMS database product. He spent the first 7 years of his career working in defense and government systems. Leon has a B.S. degree in Electronic Engineering from the University of Wales.

Resources

What is data blending. By Oleg Roderick, David Sanchez, Geisinger Data Science, ODBMS.org, November 2015

- Industrial Internet of Things: Unleashing the Potential of Connected Products and Services. World Economic Forum. January 2015

Related Posts

Can Columnar Database Systems Help Mathematical Analytics? by Carlos Ordonez, Department of Computer Science, University of Houston. ODBMS.org, 23 JAN, 2016.

The Managers Who Stare at Graphs. By Christopher Surdak, JD. ODBMS.org, 23 SEP, 2015.

From Classical Analytics to Big Data Analytics. by Peter Weidl, IT-Architect, Zürcher Kantonalbank. ODBMS.org,11 AUG, 2015

Streamlining the Big Data Landscape: Real World Network Security Usecase. By Sonali Parthasarathy Accenture Technology Labs. ODBMS.org, 2 JUL, 2015.

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/01/on-the-industrial-internet-of-things-interview-with-leon-guzenda/feed/ 0
On Big Data and Analytics. Interview with John K. Thompson. http://www.odbms.org/blog/2015/10/on-big-data-and-analytics-interview-with-john-k-thompson/ http://www.odbms.org/blog/2015/10/on-big-data-and-analytics-interview-with-john-k-thompson/#comments Tue, 27 Oct 2015 21:46:37 +0000 http://www.odbms.org/blog/?p=4014

“While it’s hard to pinpoint all of the key challenges for organizations hoping to effectively deploy their own predictive models, one significant challenge we’ve observed is the lack of C-level buy in.”–John K. Thompson

I have interviewed John K. Thompson, general manager of global advanced analytics at Dell Software. We discussed the top pieces of Big Data and Analytics news coming out of Dell World 2015.

RVZ

Q1. What are the key challenges for organizations to effectively deploy predictive models?

John: While it’s hard to pinpoint all of the key challenges for organizations hoping to effectively deploy their own predictive models, one significant challenge we’ve observed is the lack of C-level buy in. One direct example of this was Dell’s recent internal data migration from a legacy platform to its own platform, Statistica. It required major cultural change, involving identifying key change agents among Dell’s executive and senior management teams, who were responsible for enforcing governance as needed. On a technical level, Dell Statistica contains the most sophisticated algorithms for predictive analytics, machine learning and statistical analysis, enabling companies to find meaningful patterns in data. As 44 percent of organizations still don’t understand how to extract value from their data, revealed in Dell’s Global Technology Adoption Index 2015, Dell helps businesses invest wisely in data technologies, such as Statistica, to leverage the power of predictive analytics.

Q2. What is the role of users in running data analytics?  

John: End-users turn to data analytics to better understand their businesses, predict change, increase agility and control critical systems through data. Customers use Statistica for predictive modeling, visualizations, text mining and data mining. With Statistica 13’s NDA capabilities, organizations can save time and resources by allowing the analytic processing to take place in the database or Hadoop cluster, rather than pulling data to a server or desktop. With features such as these, businesses can spend more time analyzing and making decisions from their data vs. processing the information.

Q3. What are the key challenges for organizations to embed analytics across core processes? 

John: Embedding analytics across an organization’s core processes helps offer analytics to more users and allows it to become more universally accepted throughout the business. One of the largest challenges of embedding analytics is the attempt to analyze unorganized datasets. This can lead to miscategorization of the data, which can eventually result in making inaccurate business decisions. At Dell’s annual conference, Dell World, on October 20, we announced new offerings and capabilities that enable companies to embed analytics across their core processes and disseminate analytics expertise to give scalability to data-based decision making.

Q4. How is analytics related to the Internet of Things?

John: Data analytics and the Internet of Things go hand in hand. In the modern data economy, the ability to gain predictive insight from all data is critical to building an agile, connected and thriving data-driven enterprise. Whether the data comes from real-time sensors from an IoT environment, or a big data platform designed for analytics on massive amounts of disparate data, our new offerings enable detailed levels of insight and action. With the new capabilities and enhancements delivered in Statistica 13, Dell is making it possible for organizations of all sizes to deploy predictive analytics across the enterprise and beyond in a smart, simple and cost-effective manner. We believe this ultimately empowers them to better understand customers, optimize business processes, and create new products and services.

Q5. On big data and analytics Dell has announced new offerings to its end-to-end big data and analytics portfolio. What are these new offerings?

John: Dell is announcing a series of new big data and analytics solutions and services designed to help companies quickly and securely turn data into insights for better, faster decision-making. Statistica 13, the newest version of our advanced analytics software, makes it easier for organizations to deploy predictive models across the enterprise to reveal business and customer insights. Dell Services’ Analytics-as-a-Service offerings target specific industries, including banking and insurance, to provide actionable information, and better understand customers and business processes. Overall, with these enhancements, Dell is making it easier for organizations to understand how to invest in big data technologies and leverage the power of predictive analytics.

Q6. Dell is not a software company. How do you help customers turn data into insights for better decision making?

John: Dell has made great strides in the software industry, and specifically, the big data and analytics space, since our 2014 acquisition of StatSoft. Both Statistica 13 and Dell’s expanded Analytics-as-a-Service offerings help customers better unearth insights, predict business outcomes, and improve accuracy and efficiency of critical business processes. For example, the new analytics-enabled Business Process Outsourcing (BPO) services help organizations deal with fraud, denial likelihood scoring and customer retention. Additionally, the Dell ModelHealth Tracker to helps customers track and monitor the effectiveness of their various predictive analytics models, leading to better business-decision making at every level.

Q7. What are the main advancements to Dell`s analytics platform that you have introduced? And why?

John: The launch of Statistica 13 helps simplify the way organizations of all sizes deploy predictive models directly to data sources inside the firewall, in the cloud and in partner ecosystems. Additionally, Statistica 13 requires no coding and integrates seamlessly with open source R, which helps organizations leverage all data to predict future trends, identify new customers and sales opportunities, explore “what-if” scenarios, and reduce the occurrence of fraud and other business risks. The full list of enhancements include:

  • A modernized GUI for greater ease-of-use and visual appeal
  • More integration with the recently added Statistica Interactive Visualization and Dashboard engine
  • More integration with open source R allowing for more control of R scripts
  • A new stepwise model tool that gradually recommends optimum models for users
  • New Native Distributed Analytics (NDA) capabilities that allow users to run analytics directly in the database where data lives and work more efficiently with large and growing data sets

Q8. Why did you introduce a new package of analytics-as-a-service offerings for industry verticals?

John: We’re announcing new analytics-as-a-service offerings in the healthcare and financial industries as those are two areas in which we’re seeing not only extreme growth, but an increased willingness and appetite for leveraging predictive analytics. These new services include:

  • Fraud, Waste and Abuse Management:Allows businesses to better identify medical identity theft, unnecessary diagnostic services or medically unnecessary services and incorrect billing.
  • Denial Likelihood Scoring and Predictive Analytics:Allows business to proactively identify which claims are most likely to be denied while providing at-a-glance activity data on each account. This can help eliminate up to 40 percent of low- or no-value follow-up work.
  • Churn Management/Customer Retention Services:Allows businesses to leverage predictive churn modelling. This helps users identify customers they are at risk of losing and proactively take preventative measures.

Q9. Dell has launched a new purpose-built IoT gateway series with analytics capabilities. What is it and what is it useful for? 

John: The new Dell Edge Gateway 5000 Series is a solution designed purpose-built for Industrial IoT. Combined with Statistica, the solution promises to give companies an edge computing solution alternative to today’s costly and proprietary IoT offerings. Thanks to new capabilities in Statistica 13, Dell is now expanding analytics to the gateway, allowing companies to extend the benefits of cloud computing to their network edge. In turn, this allows for more secure business insights, and saves companies the costly transfer of data to and to and from the cloud.

Q10. Anything else you wish to add?

John: If you’d like to hear more about what’s coming from Dell Software at Dell World 2015, check our Twitter feed at @DellSoftware for real-time updates.
—————————————–
 John K. Thompson

John K. Thompson is the general manager of global advanced analytics at Dell Software. John has 25 years of experience in building and growing technology companies in the information management segment. He has developed and executed plans for overall sales and marketing, product development and market entry. His focus areas are big data, descriptive & predictive analytics, cognitive computing, and data mining. John holds a BS in Computer Science from Ferris State University and a MBA in Marketing from DePaul University.

Resources

Dell Study Reveals Companies Investing in Cloud, Mobility, Security and Big Data Are Growing More Than 50 Percent Faster Than Laggards, Dell Press release, 13 Oct 2015.

Related Posts

Thirst for Advanced Analytics Driving Increased Need for Collective Intelligence. By John K. Thompson, General Manager, Advanced Analytics, Dell Software. ODBMS.org, August 2015

Agility – the Key to Driving Analytics Initiatives Forward. By John K. Thompson, General Manager, Advanced Analytics, Dell Software, ODBMS.org, February 2015

Challenges and Opportunities of The Internet of Things. Interview with Steve Cellini. ODBMS Industry Watch, October 7, 2015

Big Data, Analytics, and the Internet of Things. By Mohak Shah, analytics leader and research scientist at Bosch Research, USA

SMART DATA: Running the Internet of Things as a Citizen Web. by Dirk Helbing,ETH Zurich

Who Invented Big Data (and Why Should We Care)? By Shomit Ghose, General Partner, ONSET Ventures

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2015/10/on-big-data-and-analytics-interview-with-john-k-thompson/feed/ 0
On Fraud Analytics and Fraud Detection. Interview with Bart Baesens http://www.odbms.org/blog/2015/09/on-fraud-analytics-and-fraud-detection-interview-with-bart-baesens/ http://www.odbms.org/blog/2015/09/on-fraud-analytics-and-fraud-detection-interview-with-bart-baesens/#comments Fri, 04 Sep 2015 04:56:37 +0000 http://www.odbms.org/blog/?p=3997

“Many companies don’t use analytical fraud detection techniques yet. In fact, most still rely on an expert based approach, meaning that they build upon the experience, intuition and business knowledge of the fraud analyst.” –Bart Baesens

On the topics Fraud Analytics and Fraud Detection I have interviewed Bart Baesens, professor at KU Leuven (Belgium), and lecturer at the University of Southampton (United Kingdom).

RVZ

Q1. What is exactly Fraud Analytics?

Good question! First of all, in our book we define fraud as an uncommon, well-considered, imperceptibly concealed, time-evolving and often carefully organized crime which appears in many types of forms. The idea of using analytics for fraud detection is catalyzed by the enormous amount of data which is currently being generated in any business process. Think about insurance claim handling, credit card transactions, cash transfers, tax payments, etc. to name a few. In our book, we discuss various ways of analyzing these massive data sets in a descriptive, predictive or social network way to come up with new analytical fraud detection models.

Q2. What are the main challenges in Fraud Analytics? 

The definition we gave above highlights the 5 key challenges in fraud analytics. The first one concerns the fact that fraud is uncommon. Independent of the exact setting or application, only a minority of the involved population of cases typically concerns fraud, of which furthermore only a limited number will be known to concern fraud. This seriously complicates the estimation of analytical models.

Fraudsters try to blend into the environment and not behave different from others in order not to get noticed and to remain covered by non-fraudsters. This effectively makes fraud imperceptibly concealed, since fraudsters do succeed in hiding by well considering and planning how to precisely commit fraud.

Fraud detection systems improve and learn by example. Therefore the techniques and tricks fraudsters adopt evolve in time along with, or better ahead of fraud detection mechanisms. This cat and mouse play between fraudsters and fraud fighters may seem to be an endless game, yet there is no alternative solution so far. By adopting and developing advanced analytical fraud detection and prevention mechanisms, organizations do manage to reduce losses due to fraud since fraudsters, like other criminals, tend to look for the easy way and will look for other, easier opportunities.

Fraud is typically a carefully organized crime, meaning that fraudsters often do not operate independently, have allies, and may induce copycats. Moreover, several fraud types such as money laundering and carousel fraud involve complex structures that are set up in order to commit fraud in an organized manner. This makes fraud not to be an isolated event, and as such in order to detect fraud the context (e.g., the social network of fraudsters) should be taken into account. This is also extensively discussed in our book.

A final element in the description of fraud provided in our book indicates the many different types of forms in which fraud occurs. This both refers to the wide set of techniques and approaches used by fraudsters as well as to the many different settings in which fraud occurs or economic activities that are susceptible to fraud.

Q3. What is the current state of the art in ensuring early detection in order to mitigate fraud damage?

Many companies don’t use analytical fraud detection techniques yet. In fact, most still rely on an expert based approach, meaning that they build upon the experience, intuition and business knowledge of the fraud analyst. Such an expert-based approach typically involves a manual investigation of a suspicious case, which may have been signaled for instance by a customer complaining of being charged for transactions he did not do. Such a disputed transaction may indicate a new fraud mechanism to have been discovered or developed by fraudsters, and therefore requires a detailed investigation for the organization to understand and subsequently address the new mechanism.

Comprehension of the fraud mechanism or pattern allows extending the fraud detection and prevention mechanism which is often implemented as a rule base or engine, meaning in the form of a set of IF-THEN rules, by adding rules that describe the newly detected fraud mechanism. These rules, together with rules describing previously detected fraud patterns, are applied to future cases or transactions and trigger an alert or signal when fraud is or may be committed by use of this mechanism. A simple, yet possibly very effective example of a fraud detection rule in an insurance claim fraud setting goes as follows:

IF:

  • Amount of claim is above threshold OR
  • Severe accident, but no police report OR
  • Severe injury, but no doctor report OR
  • Claimant has multiple versions of the accident OR
  • Multiple receipts submitted

THEN:

  • Flag claim as suspicious AND
  • Alert fraud investigation officer

Such an expert approach suffers from a number of disadvantages. Rule bases or engines are typically expensive to build, since requiring advanced manual input by the fraud experts, and often turn out to be difficult to maintain and manage. Rules have to be kept up to date and only or mostly trigger real fraudulent cases, since every signaled case requires human follow-up and investigation. Therefore the main challenge concerns keeping the rule base lean and effective, in other words deciding upon when and which rules to add, remove, update, or merge.

By using data-driven analytical models such as descriptive, predictive or social network analytics in a complimentary way, we can improve the performance of our fraud detection approaches in terms of precision, cost efficiency and operational effectiveness.

Q4. Is early detection all that can be done? Are there any other advanced techniques that can be used?

You can do more than just detection. More specifically, two components that are essential parts of almost any effective strategy to fight fraud concern fraud detection and fraud prevention. Fraud detection refers to the ability to recognize or discover fraudulent activities, whereas fraud prevention refers to measures that can be taken aiming to avoid or reduce fraud. The difference between both is clear-cut, the former is an ex post approach whereas the latter an ex ante approach. Both tools may and likely should be used in a complementary manner to pursue the shared objective, being fraud reduction. However, as also discussed in our book, preventive actions will change fraud strategies and consequently impact detection power. Installing a detection system will cause fraudsters to adapt and change their behavior, and so the detection system itself will impair eventually its own detection power. So although complementary, fraud detection and prevention are not independent and therefore should be aligned and considered a whole.

Q5. How do you examine fraud patterns in historical data? 

You can examine it in two possible ways: descriptive or predictive. Descriptive analytics or unsupervised learning aims at finding unusual anomalous behavior deviating from the average behavior or norm. This norm can be defined in various ways. It can be defined as the behavior of the average customer at a snapshot in time, or as the average behavior of a given customer across a particular time period, or as a combination of both. Predictive analytics or supervised learning assumes the availability of a historical data set with known fraudulent transactions. The analytical models built can thus only detect fraud patterns as they occurred in the past. Consequently, it will be impossible to detect previously unknown fraud. Predictive analytics can however also be useful to help explain the anomalies found by descriptive analytics.

Q6. How do you typically utilize labeled, unlabeled, and networked data  for fraud detection? 

Labeled observations or transactions can be analyzed using predictive analytics. Popular techniques here are linear/logistic regression, neural networks and ensemble methods such as random forests. These techniques can be used to predict both fraud incidence, which is a classification problem, as well as fraud intensity, which is a classical regression problem. Unlabeled data can be investigated using descriptive analytics. As said, the aim here is to detect anomalies deviating from the norm. Popular techniques here are: break point analysis, peer group analysis, association rules and clustering. Networked data can be analyzed using social network techniques. We found those to be very useful in our research. Popular techniques here are community detection and featurization. In our research, we developed GOTCHA!, a supervised social network learner for fraud detection. This is also extensively discussed in our book.

Q6. Fraud techniques change over time. How do you handle this? 

Good point! A key challenge concerns the dynamic nature of fraud. Fraudsters try to constantly out beat detection and prevention systems by developing new strategies and methods. Therefore adaptive analytical models and detection and prevention systems are required, in order to detect and resolve fraud as soon as possible. Detecting fraud as early as possible is crucial. Hence, we also discuss how to continuously backtest analytical fraud detection models. The key idea here is to verify whether the fraud model still performs satisfactory. Changing fraud tactics creates concept drift implying that the relationship between the target fraud indicator and the data available changes on an on-going basis. Hence, it is important to closely follow-up the performance of the analytical model such that concept drift and any related performance deviation can be detected in a timely way. Depending upon the type of model and its purpose (e.g. descriptive or predictive), various backtesting activities can be undertaken. Examples are backtesting data stability, model stability and model calibration.

Q7. What are the  synergies between Fraud Analytics and CyberSecurity?

Fraud analytics creates both opportunities as well as threats for cybersecurity. Think about intrusion detection as an example Predictive methods can be adopted to study known intrusion patterns, whereas descriptive methods or anomaly detection can identify emerging cyber threats. The emergence of the Internet of Things (IoT) will certainly exacerbate the importance of fraud analytics for cybersecurity. Some examples of new fraud treats are:

  • Fraudsters might force access to web configurable devices (e.g. Automated Teller Machines (ATMs)) and set up fraudulent transactions;
  • Device hacking whereby fraudsters change operational parameters of connected devices (e.g. smart meters are manipulated to make them under register actual usage);
  • Denial of Service (DoS) attacks whereby fraudsters massively attack a connected device to stop it from functioning;
  • Data breach whereby a user’s log in information is obtained in a malicious way resulting into identity theft;
  • Gadget fraud also referred to as gadget lust whereby fraudsters file fraudulent claims to either obtain a new gadget or free upgrade;
  • Cyber espionage whereby exchanged data is eavesdropped by an intelligence agency or used by a company for commercial purposes.

More than ever before, fraud will be dynamic and continuously changing in an IoT context. From an analytical perspective, this implies that predictive techniques will continuously lag behind since they are based on a historical data set with known fraud patterns. Hence, as soon as the predictive model has been estimated, it will become outdated even before it has been put into production. Descriptive methods such as anomaly detection, peer group and break point analysis will gain in importance. These methods should be capable of analyzing evolving data streams and perform incremental learning to deal with concept drift. To facilitate (near) real-time fraud detection, the data and algorithms should be processed in-memory instead of relying on slow secondary storage. Furthermore, based upon the results of these analytical models, it should be possible to take fully automated actions such as the shutdown of a smart meter or ATM.

Qx Anything else you wish to add?

We are happy to refer to our book for more information. We also value your opinion and look forward to receiving any feedback (both positive and negative)!

——–

Professor Bart Baesens is a professor at KU Leuven (Belgium), and a lecturer at the University of Southampton (United Kingdom). He has done extensive research on big data & analytics, customer relationship management, web analytics, fraud detection, and credit risk management. His findings have been published in well-known international journals and presented at international top conferences. He is also author of the books Analytics in a Big Data World (goo.gl/k3kBrB), and Fraud Analytics using Descriptive, Predictive and Social Network Techniques (http://goo.gl/nlCjUr). His research is summarised at www.dataminingapps.com. He is also teaching the E-learning course, Advanced Analytics in a Big Data World, see http://goo.gl/WibNPF. He also regularly tutors, advises and provides consulting support to international firms with respect to their analytics and credit risk management strategy.

Resources

Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection (Wiley and SAS Business Series). Authors: Bart Baesens ,Veronique Van Vlasselaer,Wouter Verbeke.
Series: Wiley and SAS Business Series, Hardcover: 400 pages. Publisher: Wiley; 1 edition,  September 2015. ISBN-10: 1119133122

Fraud Analytics:Using Supervised, Unsupervised and Social Network Learning Techniques. Authors: Bart Baesens, Véronique Van Vlasselaer, Wouter Verbeke
Publisher: Wiley 256 pages
September 2015
ISBN-13: 978-1119133124 | ISBN-10: 1119133122

– Critical Success Factors for Analytical Models: Some Recent Research Insights. Bart Baesens, ODBMS.org,
27 APR, 2015

– Analytics in a Big Data World: The Essential Guide to Data Science and its Applications. Bart Baesens, ODBMS.org, 30 APR, 2014

Related Posts

The threat from AI is real, but everyone has it wrong, Robert Munro, CEO Idibon. ODBMS.org

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2015/09/on-fraud-analytics-and-fraud-detection-interview-with-bart-baesens/feed/ 0
Big Data and the Networking industry. Interview with Oskar Mencer http://www.odbms.org/blog/2015/07/big-data-and-the-networking-industry-interview-with-oskar-mencer/ http://www.odbms.org/blog/2015/07/big-data-and-the-networking-industry-interview-with-oskar-mencer/#comments Thu, 23 Jul 2015 07:50:48 +0000 http://www.odbms.org/blog/?p=3943

“Cybersecurity is growing in importance with Obama, Xi and Cameron having announced major efforts to gain better control over the Internet. Dataflow computing enables computation as a bump-in-the-wire without disturbing the flow of packets. Building gateways out of DFEs will significantly support the Cybersecurity agenda for years to come.”–Oskar Mencer

I have interviewed Oskar Mencer, CEO and Founder at Maxeler Technologies. Main topic of the interview is Big Data and the Networking industry, and what Maxeler Technologies is contributing in this market.

RVZ

Q1. What are data flow computers and Dataflow Engines (DFEs)? What are they typically useful for?

Oskar Mencer: Dataflow computers are highly efficient systems for computational problems with large amounts of structured and unstructured data and significant mission critical computation. DFEs are units within dataflow computers which currently hold up to 96GB of memory and provide in the order of 10K parallel operations. To put that into perspective, for some tasks, a DFE has the equivalent compute capability of a farm of several normal computers, but at the fraction of the price and energy consumption.

Q2. What is data flow analytics? and why is it important?

Oskar Mencer: Dataflow analytics is a software stack on top of Dataflow computers, providing powerful compute constructs on large datasets. Dataflow analytics is a potential answer to the challenges of Big Data and the Internet of Things.

Q3. What is a programmable data plane and how can one create secure storage with it?

Oskar Mencer: Software Defined Networking is all about the programmable control plane. Maxeler’s programmable data plane is the next step in the transformation of the Networking industry.

Q4. What are the main challenges for financial institutions who need to analyze and process massive quantities of information instantly from various sources in order to make better trading decisions?

Oskar Mencer: Today’s financial institutions have a major challenge from new legislation and requirements imposed by governments. Technology can solve some of the issues, but not all of them. On the trading side, whoever manages to process more data and derive more predictive capability from it, has a better position in the marketplace. Trading is becoming more complex and more regulated, and Maxeler’s Technology, in particular as it applies to exchanges, is starting to make a significant difference in the field, helping to push the state-of-the-art while simultaneously making finance safer.

Q5. Juniper Networks announced QFX5100-AA, a new application acceleration switch, and QFX-PFA, a new packet flow accelerator module. How do they plan to use Maxeler Technologies’ dataflow computing?

Oskar Mencer: The Application Acceleration module is based on Maxeler DFEs and programmble with Maxeler dataflow programming tools and infrastructure. The variety of networking applications this enables is tremendous, as is evident from our App gallery, which includes Apps for the Juniper switch .

Q6. What are the advantages of using a Spark/Hadoop appliance using a Juniper switch with programmable data plane?

Oskar Mencer: With a Juniper switch with a programmable dataplane, one could cache answers, compute in the network, optimize and merge maps, and generally make Spark/Hadoop deployment more scalable and more efficient.

Q7. Do you see a convergence of computer, networking and storage via Dataflow Engines (DFEs)?

Oskar Mencer: Indeed, DFEs provide efficiency at the core of networking, storage as well as compute. Dataflow computing has the potential to unify computation, the movement of data and the storage of data into a single system to solve the largest Big data analytics challenges that lie ahead.

Q8. Maxeler has been mentioned in a list of major HPC applications that had an impact on Quality of Life and Prosperity. Could you please explain what is special about this HPC application?

Oskar Mencer: Maxeler solutions provide competitive advantage and help in situations with mission critical challanges. In 2011 just after the hight of the credit crisis, Maxeler won the American Finance Technology Award with JP Morgan for applying dataflow computing to credit derivatives risk computations. Dataflow computing is a good solution for challenges where computing matters.

Q9. Big Data for the Common Good. What is your take on this?

Oskar Mencer: Big Data is a means to an end. Common good arises from bringing more predictability and stability into our lives. For example, many marriages have been saved by the availability of Satnav technology in cars, clearly a Big Data challenge. Medicine is an obvious Big Data challenge. Curing a patient is as much a Big Data challenge as fighting crime, and government in general. I see Maxeler’s dataflow computing technology as a key opportunity to address the Big Data challenges of today and tomorrow.

Qx Anything else you wish to add?

Oskar Mencer: Cybersecurity is growing in importance with Obama, Xi and Cameron having announced major efforts to gain better control over the Internet. Dataflow computing enables computation as a bump-in-the-wire without disturbing the flow of packets. Building gateways out of DFEs will significantly support the Cybersecurity agenda for years to come.

——————-
Oskar Mencer is CEO and Founder at Maxeler Technologies.
Prior to founding Maxeler, Oskar was Member of Technical Staff at the Computing Sciences Center at Bell Labs in Murray Hill, leading the effort in “Stream Computing”. He joined Bell Labs after receiving a PhD from Stanford University. Besides driving Maximum Performance Computing (MPC) at Maxeler, Oskar was Consulting Professor in Geophysics at Stanford University and he is also affiliated with the Computing Department at Imperial College London, having received two Best Paper Awards, an Imperial College Research Excellence Award in 2007 and a Special Award from Com.sult in 2012 for “revolutionising the world of computers”.

——————-

Resources

Programming MPC Systems. White Paper — Maxeler Technologies, ODBMS.org

Related Posts

Streamlining the Big Data Landscape: Real World Network Security Usecase. By Sonali Parthasarathy Accenture Technology Labs. ODBMS.org

WHY DATA SCIENCE NEEDS STORY TELLING. BY Steve Lohr, technology reporter for the New York Times.ODBMS.org

Pre-emptive Financial Markets Regulation – next step for Big Data. By Morgan Deane, Helvea-Baader Bank Group. ODBMS.org

Data, Process and Scenario Analytics: An Emerging Regulatory Line of Offence. BY Dr. Ramendra K Sahoo, KPMG Financial Risk Management. ODBMS.org

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2015/07/big-data-and-the-networking-industry-interview-with-oskar-mencer/feed/ 0
On Hadoop and Big Data. Interview with John Leach http://www.odbms.org/blog/2015/07/on-hadoop-and-big-data-interview-with-john-leach/ http://www.odbms.org/blog/2015/07/on-hadoop-and-big-data-interview-with-john-leach/#comments Mon, 13 Jul 2015 08:32:52 +0000 http://www.odbms.org/blog/?p=3941

“One common struggle for data-driven enterprises is managing unnecessarily complicated data workflows with bloated ETL pipelines and a lack of native system integration.”– John Leach

I have interviewed John Leach, CTO & Cofounder Splice Machine.  Main topics of the interview are Hadoop, Big Data integration and what Splice Machine has to offer in this space.  Monte Zweben, CEO of Splice Machine also contributed to the interview.

RVZ

Q1. What are the Top Ten Pitfalls to Avoid in a SQL-on-Hadoop Implementation?

John Leach, Monte Zweben:
1. Individual record lookups. Most SQL-on-Hadoop engines are designed for full table scans in analytics, but tend to be too slow for the individual record lookups and ranges scan used by operational applications.
2. Dirty Data. Dirty data is a problem for any system, but it is compounded in Big Data, often resulting in bad reports and delays to reload an entire data set.
3. Sharding. It can be difficult to know what key to distribute data and the right shard size. This results in slow queries, especially for large joins or aggregations.
4. Hotspotting. This happens when data becomes too concentrated in a few nodes, especially for time series data. The impact is slow queries and poor parallelization.
5. SQL coverage. Limited SQL dialects will make it so you can’t run queries to meet business needs. You’ll want to make sure you do your homework. Compile the list of toughest queries and test.
6. Concurrency. Low concurrency can result in the inability to power real-time apps, handle many users, support many input sources, and deliver reports as updates happen.
7. Columnar. Not all columnar solutions are created equally. Besides columnar storage, there are many other optimizations, such as vectorization and run length encoding that can have a big impact on analytic performance. If your OLAP queries run slower, common with large joins and aggregations, this will result in poor productivity. Queries may take minutes or hours instead of seconds. On the flip-side is using columnar when you need concurrency and real-time.
8. Node Sizing. Do your homework and profile your workload. Choosing the wrong node size (e.g., CPU cores, memory) can negatively impact price/performance and create performance bottlenecks.
9. Brittle ETL on Hadoop. With many SQL-on-Hadoop solutions being unable to provide update or delete capabilities without a full data reload, this can cause a very brittle ETL that will require restarting your ETL pipeline because of errors or data quality issues. The result is a missed ETL window and delayed reports to business users.
10. Cost-Based Optimizer. A cost-based optimizer improves performance by selecting the right join strategy, the right index, and the right ordering. Some SQL-on-Hadoop engines have no cost-based optimizer or relatively immature ones that can result in poor performance and poor productivity, as well as manual tuning by DBAs.

Q2. In your experience, what are the most common problems in Big Data integration?

John Leach, Monte Zweben: Providing users access to data in a fashion they can understand and at the moment they need it, while ensuring quality and security, can be incredibly challenging.

The volume and velocity of data that businesses are churning out, along with the variety of different sources, can pose many issues.

One common struggle for data-driven enterprises is managing unnecessarily complicated data workflows with bloated ETL pipelines and a lack of native system integration. Businesses may also find their skill sets, workload, and budgets over-stretched by the need to manage terabytes or petabytes of structured and unstructured data in a way that delivers genuine value to business users.

When data is siloed and there is no solution put into place, businesses can’t access the real-time insights they need to make the best decisions for their business. Performance goes down, headaches abound and cost goes way up, all in the effort to manage the data. That’s why a Big Data integration solution is a prerequisite for getting the best performance and the most real-time insights, at the lowest cost.

Q3. What are the capabilities of Hadoop beyond data storage?

John Leach, Monte Zweben: Hadoop has a very broad range of capabilities and tools:

Oozie for workflow
Pig for scripting
Mahout or SparkML for machine learning
Kafka and Storm for streaming
Flume and Sqoop for integration
Hive, Impala, Spark, and Drill for SQL analytic querying
HBase for NoSQL
Splice Machine for operational, transactional RDBMS

Q4. What programming skills are required to handle application development around Big Data platforms like Hadoop?

John Leach, Monte Zweben: To handle application development on Hadoop, individuals have choices to go raw Hadoop or SQL-on-Hadoop. When going the SQL route, very little new skills are required and developers can open connections to an RDBMS on Hadoop just like they used to do on Oracle, DB2, SQLServer, or Teradata. Raw HAdoop application developers should know their way around the core components of the Hadoop stack–such as HDFS, MapReduce, Kafaka, Storm, Oozie, Hive, Pig, HBase, and YARN. They should also be proficient in Java.

Q5. What are the current challenges for real-time application deployment on Hadoop?

John Leach, Monte Zweben: When we talk about real-time at Splice Machine, we’re focused on applications that require not only real-time responses to queries, but also real-time database updates from a variety of data sources. The former is not all that uncommon on Hadoop; the latter is nearly impossible for most Hadoop-based systems.

Deploying real-time applications on Hadoop is really a function of moving Hadoop beyond its batch processing roots to be able to handle real-time database updates with high concurrency and transactional integrity. We harness HBase along with a lockless snapshot isolation design to provide full ACID transactions across rows and tables.

This technology enables Splice Machine to execute the high concurrency of transactions required by real-time applications.

Q6. What is special about Splice Machine auto-sharding replication and failover technology?

John Leach, Monte Zweben: As part of its automatic auto-sharding, HBase horizontally partitions or splits each table into smaller chunks or shards that are distributed across multiple servers. Using the inherent failover and replication capabilities of HBase and Hadoop, Splice Machine can support applications that demand high availability.

HBase co-processors are used to embed Splice Machine in each distributed HBase region (i.e., data shard). This enables Splice Machine to achieve massive parallelization by pushing the computation down to each distributed data shard without any overhead of MapReduce.

Q7. How difficult is it for customers to migrate from legacy databases to Splice Machine?

John Leach, Monte Zweben: Splice Machine offers a variety of services to help businesses efficiently deploy the Splice Machine database and derive maximum value from their investment. These services include both implementation consulting and educational offerings delivered by our expert team.

Splice Machine has designed a Safe Journey program to significantly ease the effort and risk for companies migrating to a Splice Machine database. The Safe Journey program includes a proven methodology that helps choose the right workloads to migrate, implements risk-mitigation best practices, and includes commercial tools that automate most of the PL/SQL conversion process.

This is not to suggest that all legacy databases will convert to a Hadoop RDBMS.
The best candidates will typically have over 1TB of data, which often leads to cost and scaling issues in legacy databases.

Q8. You have recently announced partnership with Talend, mrc (michaels, ross & cole ltd.) and RedPoint Global. Why Talend, mrc, and RedPoint Global? What is the strategic meaning of these partnerships for Splice Machine?

John Leach, Monte Zweben: Our uptick in recent partnerships demonstrates the tremendous progress our team has made over the past year. We have been working relentlessly to develop the Splice Machine Hadoop RDBMS into a fully enterprise-ready database that can replace legacy database systems.

The demand for programming talent to handle application development is growing faster than the supply of skilled talent, especially around newer platforms like Hadoop. We partnered with mrc to give businesses a solution that can speed real-time application deployment on Hadoop with the staff and tools they currently have, while also offering future-proof applications over a database that scales to meet increasing data demands.

We partnered with Talend to bring our customers the benefit of two different approaches for managing data integration affordable and at scale. Talend’s rich capabilities including drag and drop user interface, and adaptable platform allow for increased productivity and streamlined testing for faster deployment of web, mobile, OLTP or Internet of Things applications.

And finally, we integrated and certified our Hadoop RDBMS on RedPoint’s Convergent Marketing Platform™ to create a new breed of solution for marketers. With cost-efficient database scale-out and real-time cross-channel execution, the solution enables enterprises to future-proof their marketing technology investment through affordable access to all their data (social, mobile, click streams, website behaviors, etc.) across a proliferating and ever-changing list of channels. Furthermore, it complements any existing Hadoop deployment, including those on the Cloudera, MapR and Hortonworks distributions.

Q9. How is Splice Machine working with Hadoop distribution partners –such as MapR, Hortonworks and Cloudera?

John Leach, Monte Zweben: Since Splice Machine does not modify HBase, it can be used with any standard Hadoop distribution that includes HBase, including Cloudera, MapR and Hortonworks. Splice Machine enables enterprises using these three companies to tap into real-time updates with transactional integrity, an important feature for companies looking to become real-time, data-driven businesses.

In 2013, Splice Machine partnered with MapR to enable companies to use the MapR distribution for Hadoop to build their real time, SQL-on-Hadoop applications. In 2014, we joined the Cloudera Connect Partner Program, after certifying on CDH 5. We are working closely with Cloudera to maximize the potential of its full suite of Hadoop-powered software and our unique approach to real-time Hadoop.

That same year, we joined Hortonworks Technology Partner program. This enabled our users to harness innovations in management, provisioning and security for HDP deployments. For HDP users, Splice Machine enables them to build applications that use ANSI-standard SQL and support real-time updates with transactional integrity, allowing Hadoop to be used in both OLTP and OLAP applications.

Earlier this year, we were excited to achieve Hortonworks® Data Platform (HDP™) Certification. With the HDP certification, our customers can leverage the pre-built and validated integrations between leading enterprise technologies and the Hortonworks Data Platform, the industry’s only 100-percent open source Hadoop distribution, to simplify and accelerate their Splice Machine and Hadoop deployments.

Q10 What are the challenges of running online transaction processing on Hadoop?

John Leach, Monte Zweben: With its heritage as a batch processing system, Hadoop does not provide the transaction support required by online transaction processing. Transaction support can be tricky enough to implement for shared-disk RDBMSs such as Oracle, but it becomes far more difficult to implement in distributed environments such as Hadoop. A distributed transactional model requires high-levels of coordination across a cluster with too much overhead, while simultaneously providing high performance for a high concurrency of small read and writes, high-speed ingest, and massive bulk loads. We prove this by being able to run the TPC-C benchmark at scale.

Splice Machine met those requirements by using distributed snap isolation, a Multi-Version Concurrency Control model that delivers lockless, and high-concurrency transactional support. Splice Machine extended research from Google’s Percolator project, Yahoo Lab’s OMID project, and the University of Waterloo’s HBaseSI project to develop its own patent-pending, distributed transactions.

 

———————-
John LeachCTO & Cofounder Splice Machine
With over 15 years of software experience under his belt, John’s expertise in analytics and BI drives his role as Chief Technology Officer. Prior to Splice Machine, John founded Incite Retail in June 2008 and led the company’s strategy and development efforts. At Incite Retail, he built custom Big Data systems (leveraging HBase and Hadoop) for Fortune 500 companies.
Prior to Incite Retail, he ran the business intelligence practice at Blue Martini Software and built strategic partnerships with integration partners. John was a key subject matter expert for Blue Martini Software in many strategic implementations across the world. His focus at Blue Martini was helping clients incorporate decision support knowledge into their current business processes utilizing advanced algorithms and machine learning.
John received dual bachelor’s degrees in biomedical and mechanical engineering from Washington University in Saint Louis. Leach currently is the organizer for the Saint Louis Hadoop Users Group and is active in the Washington University Elliot Society.

Monte Zweben – CEO & Cofounder Splice Machine
A technology industry veteran, Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program.
Monte then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit.

In 1998, Monte was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA.
Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings.

Zweben currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.

 

Resources

– Splice Machine resource page, ODBMS.org

Related Posts

Common misconceptions about SQL on Hadoop. By Cynthia M. Saracco, ODBMS.org, July 2015

– SQL over Hadoop: Performance isn’t everything… By Simon Harris, ODBMS.org, March 2015

– Archiving Everything with Hadoop. By Mark Cusack, ODBMS.org. December 2014.

–  On Hadoop RDBMS. Interview with Monte Zweben. ODBMS Industry Watch  November 2, 2014

– AsterixDB: Better than Hadoop? Interview with Mike Carey, ODBMS Industry Watch, October 22, 2014

 

Follow ODBMS.org on Twitter: @odbmsorg

##

 

]]>
http://www.odbms.org/blog/2015/07/on-hadoop-and-big-data-interview-with-john-leach/feed/ 0