ODBMS Industry Watch » Analytics http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Fri, 09 Feb 2018 21:04:31 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.19 On the InterSystems IRIS Data Platform. http://www.odbms.org/blog/2018/02/on-the-intersystems-iris-data-platform/ http://www.odbms.org/blog/2018/02/on-the-intersystems-iris-data-platform/#comments Fri, 09 Feb 2018 15:16:22 +0000 http://www.odbms.org/blog/?p=4572

“We believe that businesses today are looking for ways to leverage the large amounts of data collected, which is driving them to try to minimize, or eliminate, the delay between event, insight, and action to embed data-driven intelligence into their real-time business processes.” –Simon Player

I have interviewed Simon Player, Director of Development for TrakCare and Data PlatformsHelene Lengler, Regional Director for DACH & BeNeLux, and  Joe Lichtenberg, Director of Marketing for Data Platforms. All three work at InterSystems. We talked about the new InterSystems IRIS Data Platform.

RVZ

Q1. You recently  announced the InterSystems IRIS Data Platform®. What is it?

Simon Player: We believe that businesses today are looking for ways to leverage the large amounts of data collected, which is driving them to try to minimize, or eliminate, the delay between event, insight, and action to embed data-driven intelligence into their real-time business processes.

It is time for database software to evolve and offer multiple capabilities to manage that business data within a single, integrated software solution. This is why we chose to include the term ‘data platform’ in the product’s name.
InterSystems IRIS Data Platform supports transactional and analytic workloads concurrently, in the same engine, without requiring moving, mapping, or translating the data, eliminating latency and complexity. It incorporates multiple, disparate and dissimilar data sources, supports embedded real-time analytics, easily scales for growing data and user volumes, interoperates seamlessly with other systems, and provides flexible, agile, Dev Ops-compatible deployment capabilities.

InterSystems IRIS provides concurrent transactional and analytic processing capabilities; support for multiple, fully synchronized data models (relational, hierarchical, object, and document); a complete interoperability platform for integrating disparate data silos and applications; and sophisticated structured and unstructured analytics capabilities supporting both batch and real-time use cases in a single product built from the ground up with a single architecture. The platform also provides an open analytics environment for incorporating best-of-breed analytics into InterSystems IRIS solutions, and offers flexible deployment capabilities to support any combination of cloud and on-premises deployments.

Q2. How is InterSystems IRIS Data Platform positioned with respect to other Big Data platforms in the market (e.g. Amazon Web Services, Cloudera, Hortonworks Data Platform, Google Cloud Platform, IBM Watson Data Platform and Watson Analytics, Oracle Data Cloud system, Microsoft Azure, to name a few) ?

Joe Lichtenberg: Unlike other approaches that require organizations to implement and integrate different technologies, InterSystems IRIS delivers all of the functionality in a single product with a common architecture and development experience, making it faster and easier to build real-time, data rich applications. However it is an open environment and can integrate with existing technologies already in use in the customer’s environment.

Q3. How do you ensure High Performance with Horizontal and Vertical Scalability? 

Simon Player: Scaling a system vertically by increasing its capacity and resources is a common, well-understood practice. Recognizing this, InterSystems IRIS includes a number of built-in capabilities that help developers leverage the gains and optimize performance. The main areas of focus are Memory, IOPS and Processing management. Some of these tuning mechanisms operate transparently, while others require specific adjustments on the developer’s own part to take full advantage.
One example of those capabilities is parallel query execution, built on a flexible infrastructure for maximizing CPU usage, it spawns one process per CPU core, and is most effective with large data volumes, such as analytical workloads that make large aggregation.

When vertical scaling does not provide the complete solution—for example, when you hit the inevitable hardware (or budget) ceiling—data platforms can also be scaled horizontally. Horizontal scaling fits very well with virtual and cloud infrastructure, in which additional nodes can be quickly and easily provisioned as the workload grows, and decommissioned if the load decreases.
InterSystems IRIS accomplishes this by providing the ability to scale for both increasing user volume and increasing data volume.

For increased user capacity, we leverage a distributed cache with an architectural solution that partitions users transparently across a tier of application servers sitting in front of our data server(s). Each application server handles user queries and transactions using its own cache, while all data is stored on the data server(s), which automatically keeps the application server caches in sync.

For increased data volume, we distribute the workload to a sharded cluster with partitioned data storage, along with the corresponding caches, providing horizontal scaling for queries and data ingestion. In a basic sharded cluster, a sharded table is partitioned horizontally into roughly equal sets of rows called shards, which are distributed across a number of shard data servers. For example, if a table with 100 million rows is partitioned across four shard data servers, each stores a shard containing about 25 million rows. Queries against a sharded table are decomposed into multiple shard-local queries to be run in parallel on multiple servers; the results are then transparently combined and returned to the user. This distributed data layout can further be exploited for parallel data loading and with third party frameworks like Apache Spark.

Horizontal clusters require greater attention to the networking component to ensure that it provides sufficient bandwidth for the multiple systems involved and is entirely transparent to the user and the application.

Q4. How can you simultaneously processes both transactional and analytic workloads in a single database?

Simon Player: At the core of InterSystems IRIS is a proven, enterprise-grade, distributed, hybrid transactional-analytic processing (HTAP) database. It can ingest and store transactional data at very high rates while simultaneously processing high volumes of analytic workloads on real-time data (including ACID-compliant transactional data) and non-real-time data. This architecture eliminates the delays associated with moving real-time data to a different environment for analytic processing. InterSystems IRIS is built on a distributed architecture to support large data volumes, enabling organizations to analyze very large data sets while simultaneously processing large amounts of real-time transactional data.

Q5. There are a wide range of analytics, including business intelligence, predictive analytics, distributed big data processing, real-time analytics, and machine learning. How do you support them in the InterSystems IRIS  Data Platform?

Simon Player: Many of these capabilities are built into the platform itself and leverage that tight integration to simultaneously processes both transactional and analytic workloads; however, we realize that there are multiple use cases where customers and partners would like InterSystems IRIS Data Platform to access data on other systems or to build solutions that leverage best-of-breed tools (such as ML algorithms, Spark etc.) to complement our platform and quickly access data stored on it.
That’s why we chose to provide open analytics capabilities supporting industry standard APIs such as UIMA, Java Integration, xDBC and other connectivity options.

Q6. What about third-party analytics tools? 

Simon Player:  The InterSystems IRIS Data Platform offers embedded analytics capabilities such as business intelligence, distributed big data processing & natural language processing, which can handle both structured and unstructured data with ease. It is designed as an Open Analytics Platform, built around a universal, high-performance and highly scalable data store.
Third-party analytics tools can access data stored on the platform via standard APIs including ODBC, JDBC, .NET, SOAP, REST, and the new Apache Spark Connector. In addition, the platform supports working with industry-standard analytical artifacts such as predictive models expressed in PMML and unstructured data processing components adhering to the UIMA standard.

Q7. How does InterSystems IRIS Data Platform integrate into existing infrastructures and with existing best-of-breed technologies (including your own products)?

Simon Player:  InterSystems IRIS offers a powerful, flexible integration technology that enables you to eliminate “siloed” data by connecting people, processes, and applications. It includes the comprehensive range of technologies needed for any connectivity task.
InterSystems IRIS can connect to your existing data and applications, enabling you to leverage your investment, rather than “ripping and replacing.” With its flexible connectivity capabilities, solutions based on InterSystems IRIS can easily be deployed in any client environment.

Built-in support for standard APIs enables solutions based on InterSystems IRIS to leverage applications that use Java, .NET, JavaScript, and many other languages. Support for popular data formats, including JSON, XML, and more, cuts down time to connect to other systems.

A comprehensive library of adapters provides out-of-the-box connectivity and data transformations for packaged applications, databases, industry standards, protocols, and technologies – including SQL, SOAP, REST, HTTP, FTP, SAP, TCP, LDAP, Pipe, Telnet, and Email.

Object inheritance minimizes the effort required to build any needed custom adapters. Using InterSystems IRIS’ unit testing service, custom adapters can be tested without first having to complete the entire solution. Traceability of each event allows efficient analysis and debugging.

The InterSystems IRIS messaging engine offers guaranteed message delivery, content-based routing, high-performance message transformation, and support for both synchronous and asynchronous interactions. InterSystems IRIS has a graphical editor for business process orchestration, a business rules engine, and a workflow editor that enable you to automate your enterprise-wide business procedures or create new composite applications. With world-class support for XML, SOAP, JSON and REST, InterSystems

IRIS is ideal for creating an Enterprise Service Bus (ESB) or employing a Service-Oriented Architecture (SOA).

Because it includes a high performance transactional-analytic database, InterSystems IRIS can store and analyze messages as they flow through your system. It enables business activity monitoring, alerting, real-time business intelligence, and event processing.

· Other integration point with industry standards or best-of-breed technologies include the ability to easily transport files between client machines and the server in a secure via our Managed File Transfer (MFT) capability. This functionality leverages state-of-the-art MFT providers like Box, Dropbox and KiteWorks to provide a simple client that non-technical users can install and companies can pre-configure and brand. InterSystems IRIS connects with these providers as a peer and exposes common APIs (e.g. to manage users)

· When using Apache Spark for large distributed data processing and analytics tasks, the Spark Connector will leverage the distributed data layout of sharded tables and push computation as close to the data as possible, increasing parallelism and thus overall throughput significantly vs regular JDBC connections.

Q8. What market segments do you address with IRIS  Data Platform?

Helene Lengler: InterSystems IRIS is an open platform that suits virtually any industry, but we will be initially focusing on a couple of core market segments, primarily due to varying regional demand. For instance, we will concentrate on the financial services industry in the US or UK and the retail and logistics market in the DACH and Benelux regions. Additionally, in Germany and Japan, our major focus will be on the manufacturing industry, where we see a rapidly growing demand for data-driven solutions, especially in the areas of predictive maintenance and predictive analytics.
We are convinced that InterSystems IRIS is ideal for this and also for other kinds of IoT applications with its ability to handle large-scale transactional and analytic workloads On top of this, we are also looking to engage with companies that are at the very beginning of product development – in other words, start-ups and innovators working on solutions that require a robust, future-proof data platform.

Q9. Are there any proof of concepts available? 

Helene Lengler: Yes. Although the solution has only been available to selected partners for a couple of weeks, we have already completed the first successful migration in Germany. A partner that is offering an Enterprise Information Management System, which allows organizations to archive and access all of an organization’s data, documents, emails and paper files has been able to migrate from InterSystems Caché to InterSystems IRIS in as little as a couple of hours and – most importantly – without any issues at all. The partner decided to move to InterSystems IRIS because they are in the process of signing a contract with one of the biggest players in the German travel & transport industry. With customers like this, you are looking at data volumes in the Petabyte range very, very shortly, meaning you require the right technology from the start in order to be able to scale horizontally – using the InterSystems IRIS technologies such as sharding – as well as vertically.

In addition, we were able to show a live IoT demonstrator at our InterSystems DACH Symposium in November 2017. This proof of concept is actually a lighthouse example of what the new platform’s brings to the table: A team of three different business partners and InterSystems experts leveraged InterSystems IRIS’ capabilities to rapidly develop and implement a fully functional solution for a predictive maintenance scenario. Numerous other test scenarios and PoC’s are currently being conducted in various industry segments with different partners around the globe.

Q10. Can developers already use InterSystems IRIS Data Platform? 

Simon Player: Yes. Starting on 1/31, developers can use our sandbox, the InterSystems IRIS Experience, at www.intersystems.com/experience.

Qx. Anything else you wish to add?

Simon Player: The public is welcome to join the discussion on how to graduate from database to data platform on our developer community at https://community.intersystems.com.

——————————–
imgres
Simon Player is director of development for both TrakCare and Data Platforms at InterSystems. Simon has used and developed on InterSystems technologies since the early 1990s. He holds a BSc in Computer Sciences from the University of Manchester.

Lengler,Helene-658-web

Helene Lengler is the Regional Managing Director for the DACH and Benelux regions. She joined InterSystems in July 2016 and has more than 25 years of experience in the software technology industry. During her professional career, she has held various senior positions at Oracle, including Vice President (VP) Sales Fusion Middleware and member of the executive board at Oracle Germany, VP Enterprise Sales and VP of Oracle Direct. Prior to her 16 years at Oracle, she worked for the Digital Equipment Corporation in several business disciplines such as sales, marketing and presales.
Helene holds a Masters degree from the Julius-Maximilians-University in Würzburg and a post-graduate Business Administration degree from AKAD in Pinneberg.

imgres-1
Joe Lichtenberg is responsible for product and industry marketing for data platform software at InterSystems. Joe has decades of experience working with various data management, analytics, and cloud computing technology providers.

Resources

InterSystems IRIS Data Platform, Product Page.

E-Book (IDC): Slow Data Kills Business.

White Paper (ESG): Building Smarter, Faster, and Scalable Data-rich Applications for Businesses that Operate in Real Time. 

Achieving Horizontal Scalability, Alain Houf – Sales Engineer, InterSystems

Horizontal Scalability with InterSystems IRIS

Press release:InterSystems IRIS Data Platform™ Now Available.

Related Posts

Facing the Challenges of Real-Time Analytics. Interview with David Flower. Source: ODBMS Industry Watch,Published on 2017-12-19

On the future of Data Warehousing. Interview with Jacque Istok and Mike Waas. Source: ODBMS Industry Watch,Published on 2017-11-09

On Vertica and the new combined Micro Focus company. Interview with Colin Mahony. Source: ODBMS Industry Watch, Published on 2017-10-25

On Open Source Databases. Interview with Peter Zaitsev Source: ODBMS Industry Watch, Published on 2017-09-06

Follow up on Twitter: @odbsmorg

##

]]>
http://www.odbms.org/blog/2018/02/on-the-intersystems-iris-data-platform/feed/ 0
Facing the Challenges of Real-Time Analytics. Interview with David Flower http://www.odbms.org/blog/2017/12/facing-the-challenges-of-real-time-analytics-interview-with-david-flower/ http://www.odbms.org/blog/2017/12/facing-the-challenges-of-real-time-analytics-interview-with-david-flower/#comments Tue, 19 Dec 2017 19:24:11 +0000 http://www.odbms.org/blog/?p=4534

“We are now seeing a number of our customers in financial services adopt a real-time approach to detecting and preventing fraudulent credit card transactions. With the use of ML integrating into the real-time rules engine within VoltDB, the transaction can be monitored, validated and either rejected or passed, before being completed, saving time and money for both the financial institution and the consumer.”–David Flower.

I have interviewed David Flower, President and Chief Executive Officer of VoltDB. We discussed his strategy for VoltDB,  and the main data challenges enterprises face nowadays in performing real-time analytics.

RVZ

Q1. You joined VoltDB as Chief Revenue Officer last year, and since March 29, 2017 you have been appointment to the role of President and Chief Executive Officer. What is your strategy for VoltDB?

David Flower : When I joined the company we took a step back to really understand our business and move from the start-up phase to growth stage. As with all organizations, you learn from what you have achieved but you also have to be honest with what your value is. We looked at 3 fundamentals;
1) Success in our customer base – industries, use cases, geography
2) Market dynamics
3) Core product DNA – the underlying strengths of our solution, over and above any other product in the market

The outcome of this exercise is we have moved from a generic veneer market approach to a highly focused specialized business with deep domain knowledge. As with any business, you are looking for repeatability into clearly defined and understood market sectors, and this is the natural next phase in our business evolution and I am very pleased to report that we have made significant progress to date.

With the growing demand for massive data management aligned with real-time decision making, VoltDB is well positioned to take advantage of this opportunity.

Q2. VoltDB is not the only in-memory transactional database in the market. What is your unique selling proposition and how do you position VoltDB in the broader database market?

David Flower : The advantage of operating in the database market is the pure size and scale that it offers – and that is also the disadvantage. You have to be able to express your target value. Through our customers and the strategic review we undertook, we are now able to express more clearly what value we have and where, and equally importantly, where we do not play! Our USP’s revolve around our product principles – vast data ingestion scale, full ACID consistency and the ability to undertake real-time decisioning, all supported through a distributed low-latency in-memory architecture, and we embrace traditional RDBMS through SQL to leverage existing market skills, and reduce the associated cost of change. We offer a proven enterprise grade database that is used by some of the World’s leading and demanding brands, a fact that many other companies in our market are unable to do.

Q3. VoltDB was founded in 2009 by a team of database experts, including Dr. Michael Stonebraker (winner of the ACM Turing award). How much of Stonebraker`s ideas are still in VoltDB and what is new?

David Flower : We are both proud and privileged to be associated with Dr. Stonebraker, and his stature in the database arena is without comparison. Mike’s original ideas underpin our product philosophy and our future direction, and he continues to be actively engaged in the business and will always remain a fundamental part of our heritage. Through our internal engineering experts and in conjunction with our customers, we have developed on Mike’s original ideas to bring additional features, functions and enterprise grade capabilities into the product.

Q4. Stonebraker co-founded several other database companies. Before VoltDB, in 2005, Stonebraker co-founded Vertica to commercialize the technology behind C-Store; and after VoltDB, in 2013 he co-founded another company called Tamr. Is there any relationship between Vertica, VoltDB and Tamr (if any)?

David Flower : Mike’s legacy in this field speaks for itself. VoltDB evolved from the Vertica business and while we have no formal ties, we are actively engaged with numerous leading technology companies that enable clients to gain deeper value through close integrations.

Q5. VoltDB is a ground-up redesign of a relational database. What are the main data challenges enterprises face nowadays in performing real-time analytics?

The demand for ‘real-time’ is one of the most challenging areas for many businesses today. Firstly, the definition of real-time is changing. Batch or micro-batch processing is now unacceptable – whether that be for the consumer, customer and in some cases for compliance. Secondly, analytics is also moving from the back-end (post event) to the front-end (in-event or in-process).
The drivers around AI and ML are forcing this even more. The market requirement is now for real-time analytics but what is the value of this if you cannot act on it? This is where VoltDB excels – we enable the action on this data, in process, and when the data/time is most valuable. VoltDB is able to truly deliver on the value of translytics – the combination of real-time transactions with real-time analytics, and we can demonstrate this through real use cases.

Q6. VoltDB is specialized in high-velocity applications that thrive on fast streaming data. What is fast streaming data and why does it matter?

David Flower : As previously mentioned, VoltDB is designed for high volume data streams that require a decision to be taken ‘in-stream’ and is always consistent. Fast streaming data is best defined through real applications – policy management, authentication, billing as examples in telecoms; fraud detection & prevention in finance (such as massive credit card processing streams); customer engagement offerings in media & gaming; and areas such as smart-metering in IoT.
The underlying principle being that the window of opportunity (action) is available in the fast data stream process, and once passed the opportunity value diminishes.

Q7. You have recently announced an “Enterprise Lab Program” to accelerate the impact of real-time data analysis at large enterprise organizations. What is it and how does it work?

David Flower : The objective of the Enterprise Lab Program is to enable organizations to access, test and evaluate our enterprise solution within their own environment and determine the applicability of VoltDB for either the modernization of existing applications or for the support of next gen applications. This comes without restriction, and provides full access to our support, technical consultants and engineering resources. We realize that selecting a database is a major decision and we want to ensure the potential of our product can be fully understood, tested and piloted with access to all our core assets.

Q8. You have been quoted saying that “Fraud is a huge problem on the Internet, and is one of the most scalable cybercrimes on the web today. The only way to negate the impact of fraud is to catch it before a transaction is processed”. Is this really always possible? How do you detect a fraud in practice?

David Flower : With the phenomenal growth in e-commerce and the changing consumer demands for web-driven retailing, the concerns relating to fraud (credit card) are only going to increase. The internet creates the challenge of handling massive transaction volumes, and cyber criminals are becoming ever more sophisticated in their approach.
Traditional fraud models simply were not designed to manage at this scale, and in many cases post-transaction capture is too late – the damage has been done. We are now seeing a number of our customers in financial services adopt a real-time approach to detecting and preventing fraudulent credit card transactions. With the use of ML integrating into the real-time rules engine within VoltDB, the transaction can be monitored, validated and either rejected or passed, before being completed, saving time and money for both the financial institution and the consumer. By using the combination of post- analytics and ML, the most relevant, current and effective set of rules can be applied as the transaction is processed.

Q9. Another area where VoltDB is used is in mobile gaming. What are the main data challenges with mobile gaming platforms?

David Flower : Mobile gaming is a perfect example of fast data – large data streams that require real-time decisioning for in-game customer engagement. The consumer wants the personal interaction but with relevant offers at that precise moment in the game. VoltDB is able to support this demand, at scale and based on the individual’s profile and stage in the application/game. The concept of the right offer, to the right person, at the right time ensures that the user remains loyal to the game and the game developer (company) can maximize its revenue potential through high customer satisfaction levels.

Q11. Can you explain the purpose of VoltDB`s recently announced co-operations with Huawei and Nokia?

David Flower : We have developed close OEM relationships with a number of major global clients, of which Huawei and Nokia are representative. Our aim is to be more than a traditional vendor, and bring additional value to the table, be it in the form of technical innovation, through advanced application development, or in terms of our ‘total company’ support philosophy. We also recognize that infrastructure decisions are critical by nature, and are not made for the short-term.
VoltDB has been rigorously tested by both Huawei and Nokia and was selected for several reasons against some of the world’s leading technologies, but fundamentally because our product works – and works in the most demanding environments providing the capability for existing and next-generation enterprise grade applications.

—————
David-Flower Headshot

David Flower brings more than 28 years of experience within the IT industry to the role of President and CEO of VoltDB. David has a track record of building significant shareholder value across multiple software sectors on a global scale through the development and execution of focused strategic plans, organizational development and product leadership.

Before joining VoltDB, David served as Vice President EMEA for Carbon Black Inc. Prior to Carbon Black he held senior executive positions in numerous successful software companies including Senior Vice President International for Everbridge (NASDAQ: EVBG); Vice President EMEA (APM division) for Compuware (formerly NASDAQ: CPWR); and UK Managing Director and Vice President EMEA for Gomez. David also held the position of Group Vice President International for MapInfo Corp. He began his career in senior management roles at Lotus Development Corp and Xerox Corp – Software Division.

David attended Oxford Brookes University where he studied Finance. David retains strong links within the venture capital investment community.

Resources

– eBook: Fast Data Use Cases for Telecommunications. Ciara Byrne  2017 O’Reilly Media. ( LINK to .PDF (registration required)

– Fast Data Pipeline Design: Updating Per-Event Decisions by Swapping Tables.  July 11, 2017 BY JOHN PIEKOS, VoltDB

– VoltDB Extends Open Source Capabilities for Development of Real-Time Applications · OCTOBER 24, 2017

– New VoltDB Study Reveals Business and Psychological Impact of Waiting · OCTOBER 11, 2017

– VoltDB Accelerates Access to Translytical Database with Enterprise Lab Program · SEPTEMBER 29, 2017

Related Posts

– On Artificial Intelligence and Analytics. Interview with Narendra Mulani. ODBMS Industry Watch, December 8, 2017

 Internet of Things: Safety, Security and Privacy. Interview with Vint G. Cerf, ODBMS Indutry Watch, June 11, 2017

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/12/facing-the-challenges-of-real-time-analytics-interview-with-david-flower/feed/ 0
On Artificial Intelligence and Analytics. Interview with Narendra Mulani http://www.odbms.org/blog/2017/12/on-artificial-intelligence-and-analytics-interview-with-narendra-mulani/ http://www.odbms.org/blog/2017/12/on-artificial-intelligence-and-analytics-interview-with-narendra-mulani/#comments Fri, 08 Dec 2017 08:50:46 +0000 http://www.odbms.org/blog/?p=4523

“You can’t get good insights from bad data, and AI is playing an instrumental role in the data preparation renaissance.”–Narendra Mulani

I have interviewed Narendra Mulani, chief analytics officer, Accenture Analytics.

RVZ

Q1. What is the role of Artificial Intelligence in analytics?

Narendra Mulani: Artificial Intelligence will be the single greatest change driver of our age. Combined with analytics, it’s redefining what’s possible by unlocking new value from data, changing the way we interact with each other and technology, and improving the way we make decisions. It’s giving us wider control and extending our capabilities as businesses and as people.

AI is also the connector and culmination of many elements of our analytics strategy including data, analytics techniques, platforms and differentiated industry skills.

You can’t get good insights from bad data, and AI is playing an instrumental role in the data preparation renaissance.
AI-powered analytics essentially frees talent to focus on insights rather than data preparation which is more daunting with the sheer volume of data available. It helps organizations tap into new unstructured, contextual data sources like social, video and chat, giving clients a more complete view of their customer. Very recently we acquired Search Technologies who possess a unique set of technologies that give ‘context to content’ – whatever its format – and make it quickly accessible to our clients.
As a result, we gain more precise insights on the “why” behind transactions for our clients and can deliver better customer experiences that drive better business outcomes.

Overall, AI-powered analytics will go a long way in allowing the enterprise to find the trapped value that exists in data, discover new opportunities and operate with new agility.

Q2. How can enterprises become ‘data native’ and digital at the core to help them grow and succeed?

Narendra Mulani: It starts with embracing a new culture which we call ‘data native’. You can’t be digital to the core if you don’t embed data at the core. Getting there is no mean feat. The rate of change in technology and data science is exponential, while the rate at which humans can adapt to this change is finite. In order to close the gap, businesses need to democratize data and get new intelligence to the point where it is easily understood and adopted across the organization.
With the help of design-led analytics and app-based delivery, analytics becomes a universal language in the organization, helping employees make data-driven decisions, collaborate across teams and collectively focus efforts on driving improved outomes for the business.

Enterprises today are only using a small fraction of the data available to them as we have moved from the era of big data to the era of all data. The comprehensive, real-time view businesses can gain of their operations from connected devices is staggering.

But businesses have to get a few things right to ensure they go on this journey.

Understanding and embracing convergence of analytics and artificial intelligence is one of them. You can hardly overstate the impact AI will have on mobilizing and augmenting the value in data, in 2018 and beyond. AI will be the single greatest change driver and will have a lasting effect on how business is conducted.

Enterprises also need to be ready to seize new opportunities – and that means using new data science to help shape hypotheses, test and optimize proofs-of-concept and scale quickly. This will help you reimagine your core business and uncover additional revenue streams and expansion opportunities.

All this requires a new level of agility. To help our clients act and respond fast, we support them with our platforms, our people and our partners. Backed by deep analytics expertise, new cloud-based systems and a curated and powerful alliance and delivery network, our priority is architecting the best solution to meet the needs of each client. We offer an as-a-service engagement model and a suite of intelligent industry solutions that enable even greater agility and speed to market.

Q3. Why is machine learning (ML) such a big deal, where is it driving changes today, and what are the big opportunities for it that have not yet been tapped?

Narendra Mulani: Machine learning allows computers to discover hidden or complex patterns in data without explicit programming. The impact this has on the business is tremendous—it accelerates and augments insights discovery, eliminates tedious repetitive tasks, and essentially enables better outcomes. It can be used to do a lot of good for people, from reading a car’s license plate and forcing the driver to slow down, to allowing people to communicate with others regardless of the language they speak, and helping doctors find very early evidence of cancer.

While the potential we’re seeing for ML and AI in general is vast, businesses are still in the infancy of tapping it. Organizations looking to put AI and ML to use today need to be pragmatic. While it can amplify the quality of insights in many areas, it also increases complexity for organizations, in terms of procuring specialized infrastructure or in identifying and preparing the data to train and use AI, and with validating the results. Identifying the real potential and the challenges involved are areas where most companies today lack the necessary experience and skills and need a trusted advisor or partner.

Whenever we look at the potential AI and ML have, we should also be looking at the responsibility that comes with it. Explainable AI and AI transparency are top of mind for many computer scientists, mathematicians and legal scholars.
These are critical subjects for an ethical application of AI – particularly critical in areas such as financial services, healthcare and life sciences – to ensure that data use is appropriate, and to assess the fairness of derived algorithms.
We need recognize that, while AI is science, and science is limitless, there are always risks in how that science is used by humans, and proactively identify and address issues this might cause for people and society.

————————————————

Narendra1

Narendra Mulani is Chief Analytics Officer of Accenture Analytics, a practice that his passion and foresight have helped shape since 2012.

A connector at the core, Narendra brings machine learning, data science, data engineers and the business closer together across industries and geographies to embed analytics and create new intelligence, democratize data and foster a data native culture.

He leads a global team of industry and function-specific analytics professionals, data scientists, data engineers, analytics strategy, design and visualization experts across 56 markets to help clients unlock trapped value and define new ways to disrupt in their markets. As a leader, he believes in creating an environment that is inspiring, exciting and innovative.

Narendra takes a thoughtful approach to developing unique analytics strategies and uncovering impactful outcomes. His insight has been shared with business and trade media including Bloomberg, Harvard Business Review, Information Management, CIO magazine, and CIO Insight. Under Narendra’s leadership, Accenture’s commitment and strong momentum in delivering innovative analytics services to clients was recognized in Everest Group’s Analytics Business Process Services PEAK Matrix™ Assessment in 2016.

Narendra joined Accenture in 1997. Prior to assuming his role as Chief Analytics Officer, he was the Managing Director – Products North America, responsible for delivering innovative solutions to clients across industries including consumer goods and services, pharmaceuticals, and automotive. He was also managing director of supply chain for Accenture Management Consulting where he led a global practice responsible for defining and implementing supply chain capabilities at a diverse set of Fortune 500 clients.

Narendra graduated with a Bachelor of Commerce degree at Bombay University, where he was introduced to statistics and discovered he understood probability at a fundamental level that propelled him on his destined career path. He went on to receive an MBA in Finance in 1982 as well as a PhD in 1985 focused on Multivariate Statistics, both from the University of Massachusetts. Education remains fundamentally important to him.

As one who logs too many frequent flier miles, Narendra is an active proponent of taking time for oneself to recharge and stay at the top of your game. He practices what he preaches through early rising and active mindfulness and meditation to keep his focus and balance at work and at home. Narendra is involved with various activities that support education and the arts, and is a music enthusiast. He lives in Connecticut with his wife Nita and two children, Ravi and Nikhil.

Resources

Accenture Invests in and Forms Strategic Alliance with Leading Quantum Computing Firm 1QBit

-Accenture Forms Alliance with Paxata to Help Clients Build an Intelligent Enterprise by Putting Business Users in Control of Data

Apple & Accenture Partner to Create iOS Business Solutions

Accenture Completes Cloud-Based IT Transformation for Towergate, Helping Insurance Broker Improve Its Operations and Reduce Annual IT Costs by 30 Percent

Accenture Acquires Search Technologies to Expand Its Content Analytics and Enterprise Search Capabilities

Related Posts

How Algorithms can untangle Human Questions. Interview with Brian Christian. ODBMS Industry Watch, March 31, 2017

Big Data and The Great A.I. Awakening. Interview with Steve Lohr. ODBMS Industry Watch, December 19, 2016

Machines of Loving Grace. Interview with John Markoff. ODBMS Indutry Watch, August 11, 2016

On Artificial Intelligence and Society. Interview with Oren Etzioni. ODBMS Industry Watch, January 15, 2016

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/12/on-artificial-intelligence-and-analytics-interview-with-narendra-mulani/feed/ 0
On the future of Data Warehousing. Interview with Jacque Istok and Mike Waas http://www.odbms.org/blog/2017/11/on-the-future-of-data-warehousing-interview-with-jacque-istok-and-mike-waas/ http://www.odbms.org/blog/2017/11/on-the-future-of-data-warehousing-interview-with-jacque-istok-and-mike-waas/#comments Thu, 09 Nov 2017 08:54:27 +0000 http://www.odbms.org/blog/?p=4502

” Open source software comes with a promise, and that promise is not about looking at the code, rather it’s about avoiding vendor lock-in.” –Jacque Istok.

” The cloud has out-paced the data center by far and we should expect to see the entire database market being replatformed into the cloud within the next 5-10 years.” –Mike Waas.

I have interviewed Jacque Istok, Head of Data Technical Field for Pivotal, and Mike Waas, founder and CEO Datometry.
Main topics of the interview are: the future of Data Warehousing, how are open source and the Cloud affecting the Data Warehouse market, and Datometry Hyper-Q and Pivotal Greenplum.

RVZ

Q1. What is the future of Data Warehouses?

Jacque Istok: I believe that what we’re seeing in the market is a slight course correct with regards to the traditional data warehouse. For 25 years many of us spent many cycles building the traditional data warehouse.
The single source of the truth. But the long duration it took to get alignment from each of the business units regarding how the data related to each other combined with the cost of the hardware and software of the platforms we built it upon left everybody looking for something new. Enter Hadoop and suddenly the world found out that we could split up data on commodity servers and, with the right human talent, could move the ball forward faster and cheaper. Unfortunately the right human talent has proved hard to come by and the plethora of projects that have spawned up are neither production ready nor completely compliant or compatible with the expensive tools they were trying to replace.
So what looks to be happening is the world is looking for the features of yesterday combined with the cost and flexibility of today. In many cases that will be a hybrid solution of many different projects/platforms/applications, or at the very least, something that can interface easily and efficiently with many different projects/platforms/applications.

Mike Waas: Indeed, flexibility is what most enterprises are looking for nowadays when it comes to data warehousing. The business needs to be able to tap data quickly and effectively. However, in today’s world we see an enormous access problem with application stacks that are tightly bonded with the underlying database infrastructure. Instead of maintaining large and carefully curated data silos, data warehousing in the next decade will be all about using analytical applications from a quickly evolving application ecosystem with any and all data sources in the enterprise: in short, any application on any database. I believe data warehouses remain the most valuable of databases, therefore, cracking the access problem there will be hugely important from an economic point of view.

Q2. How is open source affecting the Data Warehouse market?

Jacque Istok: The traditional data warehouse market is having its lunch eaten by open source. Whether it’s one of the Hadoop distributions, one of the up and coming new NoSQL engines, or companies like Pivotal making large bets and open source production proven alternatives like Greenplum. What I ask prospective customers is if they were starting a new organization today, what platforms, databases, or languages would you choose that weren’t open source? The answer is almost always none. Open source software comes with a promise, and that promise is not about looking at the code, rather it’s about avoiding vendor lock-in.

Mike Waas: Whenever a technology stack gets disrupted by open source, it’s usually a sign that the technology has reached a certain maturity and customers have begun doubting the advantage of proprietary solutions. For the longest time, analytical processing was considered too advanced and too far-reaching in scope for an open source project. Greenplum Database is a great example for breaking through this ceiling: it’s the first open source database system with a query optimizer not only worth that title but setting a new standard, and a whole array of other goodies previously only available in proprietary systems.

Q3. Are databases an obstacle to adopting Cloud-Native Technology?

Jacque Istok: I believe quite the contrary, databases are a requirement for Cloud-Native Technology. Any applications that are created need to leverage data in some way. I think where the technology is going is to make it easier for developers to leverage whichever database or datastore makes the most sense for them or they have the most experience with – essentially leveraging the right tool for the right job, instead of the tool “blessed” by IT or Operations for general use. And they are doing this by automating the day 0, day 1, and day 2 operations of those databases. Making it easy to instantiate and use these platforms for anyone, which has never really been the case.

Mike Waas: In fact, a cloud-first strategy is incomplete unless it includes the data assets, i.e., the databases. Now, databases have always been one of the hardest things to move or replatform, and, naturally, it’s the ultimate challenge when moving to the cloud: firing up any new instance in the cloud is easy as 1-2-3 but what to do with the 10s of years of investment in application development? I would say it’s actually not the database that’s the obstacle but the applications and their dependencies.

Q4. What are the pros and cons of moving enterprise data to the cloud?

Jacque Istok: I think there are plenty of pros to moving enterprise data to the cloud, the extent of that list will really depend on the enterprise you’re talking to and the vertical that they are in. But cons? The only cons would be using these incredible tools incorrectly, at which point you might find yourself spending more money and feeling that things are slower or less flexible. Treating the cloud as a virtual data center, and simply moving things there without changing how they are architected or how they are used would be akin to taking

Mike Waas: I second that. A few years ago enterprises were still concerned about security, completeness of offering, and maturity of the stack. But now, the cloud has out-paced the data center by far and we should expect to see the entire database market being replatformed into the cloud within the next 5-10 years. This is going to be the biggest revolution in the database industry since the relational model with great opportunities for vendors and customers alike.

Q5. How do you quantify when is appropriate for an enterprise to move their data management to a new platform?

Jacque Istok: It’s pretty easy from my perspective, when any enterprise is done spending exorbitant amounts of money it might be time to move to a new platform. When you are coming up on a renewal or an upgrade of a legacy and/or expensive system it might be time to move to a new platform. When you have new initiatives to start it might be time to move to a new platform. When you are ready to compete with your competitors, both known and unknown (aka startups), it might be time to move to a new platform. The move doesn’t have to be scary either, as some products are designed to be a bridge to a modern a data platform.

Mike Waas: Traditionally, enterprises have held off from replatforming for too long: the switching cost has deterred them from adopting new and highly superior technology with the result that they have been unable to cut costs or gain true competitive advantage. Staying on an old platform is simply bad for business. Every organization needs to ask themselves constantly the question whether their business can benefit from adopting new technology. At Datometry, we make it easy for enterprises to move their analytics — so easy, in fact, the standard reaction to our technology is, “this is too good to be true.”

Q6. What is the biggest problem when enterprises want to move part or all of their data management to the cloud?

Jacque Istok: I think the biggest problem tends to be not architecting for the cloud itself, but instead treating the cloud like their virtual data center. Leveraging the same techniques, the same processes, and the same architectures will not lead to the cost or scalability efficiencies that you were hoping for.

Mike Waas: As Jacque points out, you really need to change your approach. However, the temptation is to use the move to the cloud as a trigger event to rework everything else at the same time. This quickly leads to projects that spiral out of control, run long, go over budget, or fail altogether. Being able to replatform quickly and separate the housekeeping from the actual move is, therefore, critical.
However, when it comes to databases, trouble runs deeper as applications and their dependencies on specific databases are the biggest obstacle. SQL code is embedded in thousands of applications and, probably most surprising, even third-party products that promise portability between databases get naturally contaminated with system-specific configuration and SQL extensions. We see roughly 90% of third-party systems (ETL, BI tools, and so forth) having been so customized to the underlying database that moving them to a different system requires substantial effort, time, and money.

Q7. How does an enterprise move the data management to a new platform without having to re-write all of the applications that rely on the database?

Mike Waas: At Datometry, we looked very carefully at this problem and, with what I said above, identified the need to rewrite applications each time new technology is adopted as the number one problem in the modern enterprise. Using Adaptive Data Virtualization (ADV) technology, this will quickly become a problem of the past! Systems like Datometry Hyper-Q let existing applications run natively and instantly on a new database without requiring any changes to the application. What would otherwise be a multi-year migration project and run into the millions, is now reduced in time, cost, and risk to a fraction of the conventional approach. “VMware for databases” is a great mental model that has worked really well for our customers.

Q8. What is Adaptive Data Virtualization technology, and how can it help adopting Cloud-Native Technology?

Mike Waas: Adaptive Data Virtualization is the simple, yet incredibly powerful, abstraction of a database: by intercepting the communication between application and database, ADV is able to translate in real-time and dynamically between the existing application and the new database. With ADV, we are drawing on decades of database research and solving what is essentially a compatibility problem between programming languages and systems with an elegant and highly effective approach. This is a space that has traditionally been served by consultants and manual migrations which are incredibly labor-intensive and expensive undertaking.
Through ADV, adopting cloud technology becomes orders of magnitude simpler as it takes away the compatibility challenges that hamper any replatforming initiative.

Q9. Can you quantify what are the reduced time, cost, and risk when virtualizing the data warehouse?

Jacque Istok: In the past, virtualizing the data warehouse meant sacrificing performance in order to get some of the common benefits of virtualization (reduced time for experimentation, maximizing resources, relative ease to readjust the architecture, etc). What we have found recently is that virtualization, when done correctly, actually provides no sacrifices in terms of performance, and the only question becomes whether or not the capital cost expenditure of bare metal versus the opex cost structure of virtual is something that makes sense for your organisation.

Mike Waas: I’d like to take it a step further and include ADV into this context too: instead of a 3-5 year migration, employing 100+ consultants, and rewriting millions of lines of application code, ADV lets you leverage new technology in weeks, with no re-writing of applications. Our customers can expect to save at least 85% of the transition cost.

Q10. What is the massively parallel processing (MPP) Scatter/Gather Streaming™ technology, and what is it useful for?

Jacque Istok: This is arguably one of the most powerful features of Pivotal Greenplum and it allows for the fastest loading of data in the industry. Effectively we scatter data into the Greenplum data cluster as fast as possible with no care in the world to where it will ultimately end up. Terabytes of data per hour, basically as much as you can feed down the wires, is sent to each of the workers within the cluster. The data is therefore disseminated to the cluster in the fastest physical way possible. At that point, each of the workers gathers the data that is pertinent to them according to the architecture you have chosen for the layout of those particular data elements, allowing for a physical optimization to be leveraged during interrogation of the data after it has been loaded.

Q11. How Datometry Hyper-Q & Pivotal Greenplum data warehouse work together?

Jacque Istok: Pivotal Greenplum is the world’s only true open source, production proven MPP data platform that provides out of the box ANSI compliant SQL capabilities along with Machine Learning, AI, Graph, Text, and Spatial analytics all in one. When combined with Datometry Hyper-Q, you can transparently and seamlessly take any Teradata application and, without changing a single line of code or a single piece of SQL, run it and stop paying the outrageous Teradata tax that you have been bearing all this time. Once you’re able to take out your legacy and expensive Teradata system, without a long investment to rewrite anything, you’ll be able to leverage this software platform to really start to analyze the data you have. And that analysis can be either on premise or in the cloud, giving you a truly hybrid and cross-cloud proven platform.

Mike Waas: I’d like to share a use case featuring Datometry Hyper-Q and Pivotal Greenplum featuring a Fortune 100 Global Financial Institution needing to scale their business intelligence application, built using 2000-plus stored procedures. The customer’s analysis showed that replacing their existing data warehouse footprint was prohibitively expensive and rewriting the business applications to a more cost-effective and modern data warehouse posed significant expense and business risk. Hyper-Q allowed the customer to transfer the stored procedures in days without refactoring the logic of the application and implement various control-flow primitives, a time-consuming and expensive proposition.

Qx. Anything else you wish to add?

Jacque Istok: Thank you for the opportunity to speak with you. We have found that there has never been a more valid time than right now for customers to stop paying their heavy Teradata tax and the combination of Pivotal Greenplum and Datometry Hyper-Q allows them to do that right now, with no risk, and immediate ROI. On top of that, they are then able to find themselves on a modern data platform – one that allows them to grow into more advanced features as they are able. Pivotal Greenplum becomes their bridge to transforming your organization by offering the advanced analytics you need but giving you traditional, production proven capabilities immediately. At the end of the day, there isn’t a single Teradata customer that I’ve spoken to that doesn’t want Teradata-like capabilities at Hadoop-like prices and you get all this and more with Pivotal Greenplum.

Mike Waas: Thank you for this great opportunity to speak with you. We, at Datometry, believe that data is the key that will unlock competitive advantage for enterprises and without adopting modern data management technologies, it is not possible to unlock value. According to the leading industry group, TDWI, “today’s consensus says that the primary path to big data’s business value is through the use of so-called ‘advanced’ forms of analytics based on technologies for mining, predictions, statistics, and natural language processing (NLP). Each analytic technology has unique data requirements, and DWs must modernize to satisfy all of them.”
We believe virtualizing the data warehouse is the cornerstone of any cloud-first strategy because data warehouse migration is one of the most risk-laden and most expensive initiatives that a company can embark on during their journey to to the cloud.
Interestingly, the cost of migration is primarily the cost of process and not technology and this is where Datometry comes in with its data warehouse virtualization technology.
We are the key that unlocks the power of new technology for enterprises to take advantage of the latest technology and gain competitive advantage.

———————
P1000783-2
Jacque Istok serves as the Head of Data Technical Field for Pivotal, responsible for setting both data strategy and execution of pre and post sales activities for data engineering and data science. Prior to that, he was Field CTO helping customers architect and understand how the entire Pivotal portfolio could be leveraged appropriately.
A hands on technologist, Mr. Istok has been implementing and advising customers in the architecture of big data applications and back end infrastructure the majority of his career.

Prior to Pivotal, Mr. Istok co-founded Professional Innovations, Inc. in 1999, a leading consulting services provider in the business intelligence, data warehousing, and enterprise performance management space, and served as its President and Chairman. Mr. Istok is on the board of several emerging startup companies and serves as their strategic technical advisor.

Mike Waas Datometry 1
Mike Waas, CEO Datometry, Inc.
Mike Waas founded Datometry after having spent over 20 years in database research and commercial database development. Prior to Datometry, Mike was Sr. Director of Engineering at Pivotal, heading up Greenplum’s Advanced R&Dteam. He is also the founder and architect of Greenplum’s ORCA query optimizer initiative. Mike has held senior engineering positions at Microsoft, Amazon, Greenplum, EMC, and Pivotal, and was a researcher at Centrum voor Wiskunde en Informatica (CWI), Netherlands, and at Humboldt University, Berlin.

Mike received his M.S. in Computer Science from University of Passau, Germany, and his Ph.D. in Computer Science from the University of Amsterdam, Netherlands. He has authored or co-authored 36 publications on the science of databases and has 24 patents to his credit.

Resources

Datometry Releases Hyper-Q Data Warehouse Virtualization Software Version 3.0. AUGUST 11, 2017

Replatforming Custom Business Intelligence | Use Case, ODBMS.org, NOVEMBER 7, 2017

Disaster Recovery Cloud Data Warehouse | Use Case. ODBMS.org, NOVEMBER 3, 2017

– Scaling Business Intelligence in the Cloud | Use Case. ODBMS.org · NOVEMBER 3, 2017

– Re-Platforming Data Warehouses – Without Costly Migration Of Applications. ODBMS.org · NOVEMBER 3, 2017

– Meet Greenplum 5: The World’s First Open-Source, Multi-Cloud Data Platform Built for Advanced Analytics. ODBMS.org · SEPTEMBER 21, 2017

Related Posts

– On Open Source Databases. Interview with Peter ZaitsevODBMS Industry Watch, Published on 2017-09-06

– On Apache Ignite, Apache Spark and MySQL. Interview with Nikita Ivanov , ODBMS Industry Watch, Published on 2017-06-30

– On the new developments in Apache Spark and Hadoop. Interview with Amr AwadallahODBMS Industry Watch, Published on 2017-03-13

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/11/on-the-future-of-data-warehousing-interview-with-jacque-istok-and-mike-waas/feed/ 0
On Vertica and the new combined Micro Focus company. Interview with Colin Mahony http://www.odbms.org/blog/2017/10/on-vertica-and-the-new-combined-micro-focus-company-interview-with-colin-mahony/ http://www.odbms.org/blog/2017/10/on-vertica-and-the-new-combined-micro-focus-company-interview-with-colin-mahony/#comments Wed, 25 Oct 2017 09:25:58 +0000 http://www.odbms.org/blog/?p=4489

” There has been no uncertainty with respect to the Micro Focus leadership’s commitment to building on the great brand and product we have developed at Vertica.”– Colin Mahony

I have interviewed Colin Mahony, SVP & General Manager, Vertica Product Group, Micro Focus.
In this interview we covered the recent spin-off of HPE software into a new combined Micro Focus company, and how this is affecting Vertica. We also covered the new release of  Vertica 9, and the importance of Big Data analytics.

RVZ

Q1.With the recent spin-off of HPE software into a new, combined Micro Focus company, do you see things changing for Vertica?

Colin Mahony: From a product development, sales and customer support perspective – it’s been business as usual at Vertica leading up to and since the spin-merge with Micro Focus. Our focus, as always, is to build the best possible product and deliver world-class support for our growing customer base. That won’t change any time soon.

The biggest changes I see post spin-merge is that Vertica is now part of a pure-play software company, rather than a business where a majority of revenue comes from hardware. Running a software company is a lot different than running a hardware business. Under HPE, the software assets sometimes struggled in establishing their own identify as part of a much larger hardware business.  Micro Focus on the other hand is designed from the ground up to build, sell and support software for our customers, that’s all we do. The new, combined Micro Focus is the 7th largest pure-play software company in the world, and we have the global scale to be an industry shaper.
But maybe even more exciting is the level of support and GTM independence that we are already seeing from Micro Focus in support of Vertica. You have likely seen Vertica’s logo and you’ll continue to see more of that, especially on the Vertica.com website that we launched in February and that already has almost 1 million page views! We have been structured uniquely in the new Micro Focus and this gives me complete confidence in our future. I’m genuinely excited about the opportunity to be in a business that is dedicated and focused purely on software – especially software with analytics built in, the new Micro Focus company mission – and the business value of that software for customers.

Q2. There are concerns that Micro Focus may end up managing mature software assets of HPE and extending their shelf life, rather than actively investing in feature developments. What is your take on this?

Colin Mahony: I fundamentally disagree with that. Micro Focus helps companies bridge their existing technologies with new infrastructure and applications. It helps them maximize their ROI while embracing innovation to address the opportunities of the new Hybrid IT and analytics-driven environment. It’s frankly wrong to expect customers to make investments in core technologies without working hard to maximize the investment in those technologies. Over the years, Micro Focus has taken core assets and made them modern, delivering significant value to the company and our customers.

It’s also important to note that the new, combined Micro Focus has an incredible depth and breadth of software assets in its portfolio – covering DevOps, IT Operations, Cloud, Security, Big Data and more – not all of which are mature products.
Take SUSE for instance, a Micro Focus product and the fastest growing open Linux platform. I’m very impressed with the approach that Micro Focus has on supporting growth businesses like this. I have the very same expectations for our Vertica business, especially because this is a massive new opportunity for Micro Focus, which prior to the spin-merge did not have a Big Data offering.
This means no confusion, no duplication of resources, and a lot of potential because we know that every company in virtually every industry is thinking about how to leverage analytics at the core of everything they do, and again, why “analytics built in” is at the core of the new company’s mission.

Q3. Will Micro Focus continue to develop Vertica?

Colin Mahony: There has been no uncertainty with respect to the Micro Focus leadership’s commitment to building on the great brand and product we have developed at Vertica. Since the spin-merge with Micro Focus was first announced in 2016, we have actually been reinvigorating the Vertica brand name, all based on the recognition that Micro Focus has a tremendous market opportunity in front of it with the advent of Big Data and the growing importance all companies are placing on the value of analytics. You can see this commitment with the build-out of our new website, www.vertica.com, our presence at industry trade shows and conferences, and more.

In a recent interview, Chris Hsu, CEO of the new, combined Micro Focus, expressed his commitment to big data analytics – and specifically Vertica – as the number one area he is most excited to focus on and grow within the portfolio. It’s an exciting time to be part of Vertica. We have an incredible opportunity in front of us.

Q4. Micro Focus now has a number of software assets covering Hybrid IT, DevOps, Security and more, where analytics is critical. Does or will Vertica play a role in those products?

Colin Mahony: Absolutely. Not only is there a strong commitment in continuing to develop Vertica as a product and brand, there’s wide recognition within Micro Focus that predictive analytics is critical for the success of data-centric enterprises, and therefore a critical component to the breadth of assets in our own portfolio.

Vertica is an ideal solution for embedded analytics. Businesses that embed Vertica stand out from the competition and deliver higher value to customers. Specifically designed for analytic workloads, Vertica’s speed and performance, advanced analytics, ease of deployment, and support for data scientists make it tailor-made for embedding. We now have an opportunity to embed these great analytical features in a range of Micro Focus software assets, something we’ve already begun to do in application delivery management, IT operations and security. As I’ve said, a core part of our company’s core mission moving forward is to provide customers with enterprise-grade scalable software with analytics built in. I see this as a large and growing opportunity for innovation here at Micro Focus.

Q5. You recently released Vertica Version 9, with major enhancements in cloud deployments and separation of compute and storage. Are these common themes for Vertica moving forward?

Colin Mahony: They are. Vertica has always been 100% committed to helping our customers deploy advanced analytics free from underlying infrastructure and hardware lock-in. We’ve seen that legacy data warehouse solutions have forced many enterprises into rigid and high-cost proprietary hardware and analytics solutions supporting only limited data formats and deployment options. As data formats and storage locations continuously evolve, organizations require a powerful and unified solution to analyze data in the right place at the right time, with the performance and economics that the business requires. Our continued commitment to this principle – and our support for any major cloud platform, whether AWS, Azure or GCP – is foundational to Vertica’s core.

Separation of compute and storage is a logical extension of this product development ethos. Vertica’s beta release of its new Eon Mode architecture, offering separation of compute and storage, provides rapid elastic scaling up and down of the Vertica cluster, with just-in-time workload-based provisioning.
An intelligent, new caching mechanism on the nodes enable organizations to benefit from Vertica’s industry-leading query performance. Companies in the AWS ecosystem will be able to leverage AWS S3 for storage and Vertica’s query-optimized analytics engine for processing speed to capitalize on cloud economics.

You can expect continued product development and investment in these areas.

Q6. With the explosion of data lakes and other external data storage (including Hadoop, AWS S3, etc.), does this complicate the analytical database market or change the dynamics of how and where you analyze data?

Colin Mahony: It certainly changes the big data landscape. Hadoop has been a boon to companies and organizations that want to store vast new volumes of unstructured data cheaply in the form of a data lake. AWS S3 has extended that cheap storage to the cloud. Although Hadoop stores massive volumes of unstructured data, performing analytics on Hadoop proved challenging. Despite this challenge, companies did not want to move large amounts of data in and out of their Hadoop data lakes. As a result, more and more companies were looking to build out enterprise-grade SQL analytics on top of their Hadoop investments. This created a tremendous opportunity for Vertica, and Vertica for SQL on Hadoop was born. Vertica SQL on Hadoop is the same binary, the same core engine, with the ability to deploy natively on Hadoop nodes. Since then, we’ve continued to innovate on how Vertica integrates with the various Hadoop distributions and file formats. We’ve leveraged our years of experience in the Big Data analytics marketplace to enable organizations to analyze their data not only in place, but in the right place – without data movement – while supporting any major cloud deployment for fast and reliable read and write for multiple data formats.

Starting with the release of Vertica 8, users could derive more value from their Hadoop data lakes with Vertica’s high-performance Parquet and ORC Readers that enable users to securely access and analyze data that resides in Hadoop data lakes without copying or moving the data. And now with our latest Vertica 9 release, we’ve introduced a new HDFS Parquet writer – built on Vertica’s fast and reliable ability to not only read, but now write data and results on HDFS – to derive and contribute immediate insights on growing data lakes. Organizations can use Vertica 9’s flexible and expanded deployment options across on-premise, private, and public clouds, and on Hadoop and AWS S3 data lakes, to adopt a best-fit analytical solution.

The days of having to move data in and out of various databases and data lakes is coming to an end. In the future, more and more companies will bring analytics to the data, analyzing it in place. We believe Vertica is working at the forefront of this market transformation.

Q7. Over the last few releases, Vertica has made significant advancements in the area of in-database machine learning. How do you see this set of capabilities contributing to Vertica’s strategy and the success of your customers?

Colin Mahony: There’s no doubt that machine learning and predictive analytics are, and will continue to be a core differentiator for organizations. In today’s data-driven world, creating a competitive advantage depends on your ability to transform massive volumes of data into meaningful insights. Vertica has always supported the world’s leading data-driven organizations with the fastest SQL and extended SQL analytics. And now, by building machine learning functions directly into Vertica’s core — with no need to download and install separate packages — we are transforming the way data scientists and analysts across industries interact with data; removing barriers and accelerating time to value on predictive analytics projects. And it’s not just about developing the right algorithms and models. Our goal at Vertica is to support the entire machine learning and predictive analytics process, from data preparation to model evaluation and deployment – all using Vertica’s industry-leading scalability and performance. I’m incredibly excited to see these features transform data science and predictive analytics projects within our customer base, and for this reason, in-database machine learning will play a major role in Vertica’s future, and the future of our customers.

Our commitment to this area can be seen in the latest Vertica 9 release, which provides a comprehensive set of new Machine Learning algorithms for categorization, overfitting and prediction to enhance processing speed by eliminating the need for down-sampling and data movement. There’s also support for new data-preparation functions for deriving greater meaning from the data, while improving the quality of analysis, and a streamlined end-to-end workflow that simplifies production deployment of Machine models – particularly for customers that embed Vertica and require the ability to replicate models across clusters.

————————

c
Colin Mahony, SVP & General Manager, Vertica Product Group, Micro Focus

Colin Mahony leads the Vertica Product Group for Micro Focus, helping the world’s most data driven organizations to leverage and monetize their business data. Vertica was founded in 2005 and is one of the industry’s fastest growing, advanced analytics platform with in database machine learning, the ability to analyze data in the right place, and freedom from underlying infrastructure. Micro Focus also leverages Vertica to deliver embedded analytics across a very broad portfolio of enterprise grade software.

In 2011, Colin joined Hewlett Packard as part of the highly successful acquisition of Vertica, and took on the responsibility of VP and General Manager for HP Vertica, where he guided the business to remarkable annual growth and recognized industry leadership. Colin brings a unique combination of technical knowledge, market intelligence, customer relationships, and strategic partnerships to one of the fastest growing and most exciting segments of HP Software.

Prior to Vertica, Colin was a Vice President at Bessemer Venture Partners focused on investments primarily in enterprise software, telecommunications, and digital media. He established a great network and reputation for assisting in the creation and ongoing operations of companies through his knowledge of technology, markets and general management in both small startups and larger companies. Prior to Bessemer, Colin worked at Lazard Technology Partners in a similar investor capacity.

Prior to his venture capital experience, Colin was a Senior Analyst at the Yankee Group serving as an industry analyst and consultant covering databases, BI, middleware, application servers and ERP systems. Colin helped build the ERP and Internet Computing Strategies practice at Yankee in the late nineties.

Colin earned an M.B.A. from Harvard Business School and a bachelor’s degrees in Economics with a minor in Computer Science from Georgetown University. He is an active volunteer with Big Brothers Big Sisters of Massachusetts Bay and the Joey Fund for Cystic Fibrosis as well as a mentor and board member of Year Up Boston.

————–

Resources

– What’s New in Vertica 9.0?, ODBMS.org, 22 Oct, 2017

– What’s New in Vertica 9.0: Eon Mode Beta, ODBMS.org, 22 Oct, 2017

– Vertica Version 9.0, ODBMS.org, 22 Oct, 2017

– Micro Focus Introduces Vertica 9, ODBMS.org, Sept. 27, 2017

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/10/on-vertica-and-the-new-combined-micro-focus-company-interview-with-colin-mahony/feed/ 0
On Apache Ignite, Apache Spark and MySQL. Interview with Nikita Ivanov http://www.odbms.org/blog/2017/06/on-apache-ignite-apache-spark-and-mysql-interview-with-nikita-ivanov/ http://www.odbms.org/blog/2017/06/on-apache-ignite-apache-spark-and-mysql-interview-with-nikita-ivanov/#comments Fri, 30 Jun 2017 13:40:51 +0000 http://www.odbms.org/blog/?p=4369

“Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.”–Nikita Ivanov.

I have interviewed Nikita Ivanov,CTO of GridGain.
Main topics of the interview are Apache Ignite, Apache Spark and MySQL, and how well they perform on big data analytics.

RVZ

Q1. What are the main technical challenges of SaaS development projects?

Nikita Ivanov: SaaS requires that the applications be highly responsive, reliable and web-scale. SaaS development projects face many of the same challenges as software development projects including a need for stability, reliability, security, scalability, and speed. Speed is especially critical for modern businesses undergoing the digital transformation to deliver real-time services to their end users. These challenges are amplified for SaaS solutions which may have hundreds, thousands, or tens of thousands of concurrent users, far more than an on-premise deployment of enterprise software.
Fortunately, in-memory computing offers SaaS developers solutions to the challenges of speed, scale and reliability.

Q2. In your opinion, what are the limitations of MySQL® when it comes to big data analytics?

Nikita Ivanov: MySQL was originally designed as a single-node system and not with the modern data center concept in mind. MySQL installations cannot scale to accommodate big data using MySQL on a single node. Instead, MySQL must rely on sharding, or splitting a data set over multiple nodes or instances, to manage large data sets. However, most companies manually shard their database, making the creation and maintenance of their application much more complex. Manually creating an application that can then perform cross-node SQL queries on the sharded data multiplies the level of complexity and cost.

MySQL was also not designed to run complicated queries against massive data sets. MySQL optimizer is quite limited, executing a single query at a time using a single thread. A MySQL query can neither scale among multiple CPU cores in a single system nor execute distributed queries across multiple nodes.

Q3. What solutions exist to enhance MySQL’s capabilities for big data analytics?

Nikita Ivanov: For companies which require real-time analytics, they may attempt to manually shard their database. Tools such as Vitess, a framework YouTube released for MySQL sharding, or ProxySQL are often used to help implement sharding.
To speed up queries, caching solutions such as Memcached and Redis are often deployed.

Many companies turn to data warehousing technologies. These solutions require ETL processes and a separate technology stack which must be deployed and managed. There are many external solutions, such as Hadoop and Apache Spark, which are quite popular. Vertica and ClickHouse have also emerged as analytics solutions for MySQL.

Apache Ignite offers speed, scale and reliability because it was built from the ground up as a high performant and highly scalable distributed in-memory computing platform.
In contrast to the MySQL single-node design, Apache Ignite automatically distributes data across nodes in a cluster eliminating the need for manual sharding. The cluster can be deployed on-premise, in the cloud, or in a hybrid environment. Apache Ignite easily integrates with Hadoop and Spark, using in-memory technology to complement these technologies and achieve significantly better performance and scale. The Apache Ignite In-Memory SQL Grid is highly optimized and easily tuned to execute high performance ANSI-99 SQL queries. The In-Memory SQL Grid offer access via JDBC/ODBC and the Ignite SQL API for external SQL commands or integration with analytics visualization software such as Tableau.

Q4. What is exactly Apache® Ignite™?

Nikita Ivanov: Apache Ignite is a high-performance, distributed in-memory platform for computing and transacting on large-scale data sets in real-time. It is 1,000x faster than systems built using traditional database technologies that are based on disk or flash technologies. It can also scale out to manage petabytes of data in memory.

Apache Ignite includes the following functionality:

· Data grid – An in-memory key value data cache that can be queried

· SQL grid – Provides the ability to interact with data in-memory using ANSI SQL-99 via JDBC or ODBC APIs

· Compute grid – A stateless grid that provides high-performance computation in memory using clusters of computers and massive parallel processing

· Service grid – A service grid in which grid service instances are deployed across the distributed data and compute grids

· Streaming analytics – The ability to consume an endless stream of information and process it in real-time

· Advanced clustering – The ability to automatically discover nodes, eliminating the need to restart the entire cluster when adding new nodes

Q5. How Apache Ignite differs from other in-memory data platforms?

Nikita Ivanov: Most in-memory computing solutions fall into one of three types: in-memory data grids, in-memory databases, or a streaming analytics engine.
Apache Ignite is a full-featured in-memory computing platform which includes an in-memory data grid, in-memory database capabilities, and a streaming analytics engine. Furthermore, Apache Ignite supports distributed ACID compliant transactions and ANSI SQL-99 including support for DML and DDL via JDBC/ODBC.

Q6. Can you use Apache® Ignite™ for Real-Time Processing of IoT-Generated Streaming Data?

Nikita Ivanov: Yes, Apache Ignite can ingest and analyze streaming data using its streaming analytics engine which is built on a high-performance and scalable distributed architecture. Because Apache Ignite natively integrates with Apache Spark, it is also possible to deploy Spark for machine learning at in-memory computing speeds.
Apache Ignite supports both high volume OLTP and OLAP use cases, supporting Hybrid Transactional Analytical Processing (HTAP) use cases, while achieving performance gains of 1000x or greater over systems which are built on disk-based databases.

Q7. How do you stream data to an Apache Ignite cluster from embedded devices?

Nikita Ivanov: It is very easy to stream data to an Apache Ignite cluster from embedded devices.
The Apache Ignite streaming functionality allows for processing never-ending streams of data from embedded devices in a scalable and fault-tolerant manner. Apache Ignite can handle millions of events per second on a moderately sized cluster for embedded devices generating massive amounts of data.

Q8. Is this different then using Apache Kafka?

Nikita Ivanov: Apache Kafka is a distributed streaming platform that lets you publish and subscribe to data streams. Kafka is most commonly used to build a real-time streaming data pipeline that reliably transfers data between applications. This is very different from Apache Ignite, which is designed to ingest, process, analyze and store streaming data.

Q9. How do you conduct real-time data processing on this stream using Apache Ignite?

Nikita Ivanov: Apache Ignite includes a connector for Apache Kafka so it is easy to connect Apache Kafka and Apache Ignite. Developers can either push data from Kafka directly into Ignite’s in-memory data cache or present the streaming data to Ignite’s streaming module where it can be analyzed and processed before being stored in memory.
This versatility makes the combination of Apache Kafka and Apache Ignite very powerful for real-time processing of streaming data.

Q10. Is this different then using Spark Streaming?

Nikita Ivanov: Spark Streaming enables processing of live data streams. This is merely one of the capabilities that Apache Ignite supports. Although Apache Spark and Apache Ignite utilize the power of in-memory computing, they address different use cases. Spark processes but doesn’t store data. It loads the data, processes it, then discards it. Ignite, on the other hand, can be used to process data and it also provides a distributed in-memory key-value store with ACID compliant transactions and SQL support.
Spark is also for non-transactional, read-only data while Ignite supports non-transactional and transactional workloads. Finally, Apache Ignite also supports purely computational payloads for HPC and MPP use cases while Spark works only on data-driven payloads.

Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.

Qx. Is there anything else you wish to add?

Nikita Ivanov: The world is undergoing a digital transformation which is driving companies to get closer to their customers. This transformation requires that companies move from big data to fast data, the ability to gain real-time insights from massive amounts of incoming data. Whether that data is generated by the Internet of Things (IoT), web-scale applications, or other streaming data sources, companies must put architectures in place to make sense of this river of data. As companies make this transition, they will be moving to memory-first architectures which ingest and process data in-memory before offloading to disk-based datastores and increasingly will be applying machine learning and deep learning to make understand the data. Apache Ignite continues to evolve in directions that will support and extend the abilities of memory-first architectures and machine learning/deep learning systems.

——–
Nikita IvanovFounder & CTO, GridGain,
Nikita Ivanov is founder of Apache Ignite project and CTO of GridGain Systems, started in 2007. Nikita has led GridGain to develop advanced and distributed in-memory data processing technologies – the top Java in-memory data fabric starting every 10 seconds around the world today. Nikita has over 20 years of experience in software application development, building HPC and middleware platforms, contributing to the efforts of other startups and notable companies including Adaptec, Visa and BEA Systems. He is an active member of Java middleware community, contributor to the Java specification. He’s also a frequent international speaker with over two dozen of talks on various developer conferences globally.

Resources

Apache Ignite Community Resources

apache/ignite on GitHub

Yardstick Apache Ignite Benchmarks

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite

Misys Uses GridGain to Enable High Performance, Real-Time Data Processing

The Spark Python API (PySpark)

Related Posts

Supporting the Fast Data Paradigm with Apache Spark. BY Stephen Dillon, Data Architect, Schneider Electric

On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah. ODBMS Industry Watch,March 13, 2017

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/06/on-apache-ignite-apache-spark-and-mysql-interview-with-nikita-ivanov/feed/ 0
Internet of Things: Safety, Security and Privacy. Interview with Vint G. Cerf http://www.odbms.org/blog/2017/06/internet-of-things-safety-security-and-privacy-interview-with-vint-g-cerf/ http://www.odbms.org/blog/2017/06/internet-of-things-safety-security-and-privacy-interview-with-vint-g-cerf/#comments Sun, 11 Jun 2017 17:06:03 +0000 http://www.odbms.org/blog/?p=4373

” I like the idea behind programmable, communicating devices and I believe there is great potential for useful applications. At the same time, I am extremely concerned about the safety, security and privacy of such devices.” –Vint G. Cerf

I had the pleasure to interview Vinton G. Cerf. Widely known as one of the “Fathers of the Internet,” Cerf is the co-designer of the TCP/IP protocols and the architecture of the Internet. Main topic of the interview is the Internet of Things (IoT) and its challenges, especially the safety, security and privacy of IoT devices.
Vint is currently Chief Internet Evangelist for Google.
RVZ

Q1. Do you like the Internet of Things (IoT)?

Vint Cerf: This question is far too general to answer. I like the idea behind programmable, communicating devices and I believe there is great potential for useful applications. At the same time, I am extremely concerned about the safety, security and privacy of such devices. Penetration and re-purposing of these devices can lead to denial of service attacks (botnets), invasion of privacy, harmful dysfunction, serious security breaches and many other hazards. Consequently the makers and users of such devices have a great deal to be concerned about.

Q2. Who is going to benefit most from the IoT?

Vint Cerf: The makers of the devices will benefit if they become broadly popular and perhaps even mandated to become part of local ecosystem. Think “smart cities” for example. The users of the devices may benefit from their functionality, from the information they provide that can be analyzed and used for decision-making purposes, for example. But see Q1 for concerns.

Q3. One of the most important requirement for collections of IoT devices is that they guarantee physical safety and personal security. What are the challenges from a safety and privacy perspective that the pervasive introduction of sensors and devices pose? (e.g. at home, in cars, hospitals, wearables and ingestible, etc.)

Vint Cerf: Access control and strong authentication of parties authorized to access device information or control planes will be a primary requirement. The devices must be configurable to resist unauthorized access and use. Putting physical limits on the behavior of programmable devices may be needed or at least advisable (e.g., cannot force the device to operate outside of physically limited parameters).

Q5. Consumers want privacy. With IoT physical objects in our everyday lives will increasingly detect and share observations about us. How is it possible to reconcile these two aspects?

Vint Cerf: This is going to be a tough challenge. Videocams that help manage traffic flow may also be used to monitor individuals or vehicles without their permission or knowledge, for example (cf: UK these days). In residential applications, one might want (insist on) the ability to disable the devices manually, for example. One would also want assurances that such disabling cannot be defeated remotely through the software.

Q6. Let`s talk about more about security. It is reported that badly configured “smart devices” might provide a backdoor for hackers. What is your take on this?

Vint Cerf: It depends on how the devices are connected to the rest of the world. A particularly bad scenario would have a hacker taking over the operating system of 100,000 refrigerators. The refrigerator programming could be preserved but the hacker could add any of a variety of other functionality including DDOS capacity, virus/worm/Trojan horse propagation and so on.
One might want the ability to monitor and log the sources and sinks of traffic to/from such devices to expose hacked devices under remote control, for example. This is all a very real concern.

Q7. What measures can be taken to ensure a more “secure” IoT?

Vint Cerf: Hardware to inhibit some kinds of hacking (e.g. through buffer overflows) can help. Digital signatures on bootstrap programs checked by hardware to inhibit boot-time attacks. Validation of software updates as to integrity and origin. Whitelisting of IP addresses and identifiers of end points that are allowed direct interaction with the device.

Q8. Is there a danger that IoT evolves into a possible enabling platform for cyber-criminals and/or for cyber war offenders?

Vint Cerf: There is no question this is already a problem. The DYN Corporation DDOS attack was launched by a botnet of webcams that were readily compromised because they had no access controls or well-known usernames and passwords. This is the reason that companies must feel great responsibility and be provided with strong incentives to limit the potential for abuse of their products.

Q9. What are your personal recommendations for a research agenda and policy agenda based on advances in the Internet of Things?

Vint Cerf: Better hardware reinforcement of access control and use of the IOT computational assets. Better quality software development environments to expose vulnerabilities before they are released into the wild. Better software update regimes that reduce barriers to and facilitate regular bug fixing.

Q10. The IoT is still very much a work in progress. How do you see the IoT evolving in the near future?

Vint Cerf: Chaotic “standardization” with many incompatible products on the market. Many abuses by hackers. Many stories of bugs being exploited or serious damaging consequences of malfunctions. Many cases of “one device, one app” that will become unwieldy over time. Dramatic and positive cases of medical monitoring that prevents serious medical harms or signals imminent dangers. Many experiments with smart cities and widespread sensor systems.
Many applications of machine learning and artificial intelligence associated with IOT devices and the data they generate. Slow progress on common standards.

—————
Google-HS-9-2008
Vinton G. Cerf co-designed the TCP/IP protocols and the architecture of the Internet and is Chief Internet Evangelist for Google. He is a member of the National Science Board and National Academy of Engineering and Foreign Member of the British Royal Society and Swedish Royal Academy of Engineering, and Fellow of ACM, IEEE, AAAS, and BCS.
Cerf received the US Presidential Medal of Freedom, US National Medal of Technology, Queen Elizabeth Prize for Engineering, Prince of Asturias Award, Japan Prize, ACM Turing Award, Legion d’Honneur and 29 honorary degrees.

Resources

European Commission, Internet of Things Privacy & Security Workshop’s Report,10/04/2017

Securing the Internet of Things. US Homeland Security, November 16, 2016

Related Posts

Social and Ethical Behavior in the Internet of Things By Francine Berman, Vinton G. Cerf. Communications of the ACM, Vol. 60 No. 2, Pages 6-7, February 2017

Security in the Internet of Things, McKinsey & Company,May 2017

Interview to Vinton G. Cerf. ODBMS Industry Watch, July 27, 2009

Five Challenges to IoT Analytics Success. By Dr. Srinath Perera. ODBMS.org, September 23, 2016

Follow us on Twitter: @odbsmorg

##

]]>
http://www.odbms.org/blog/2017/06/internet-of-things-safety-security-and-privacy-interview-with-vint-g-cerf/feed/ 0
On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/ http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/#comments Mon, 13 Mar 2017 10:54:21 +0000 http://www.odbms.org/blog/?p=4326

“What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).”–Amr Awadallah

I have interviewed Amr Awadallah, Chief Technology Officer at Cloudera.  
Main topics of the interview are: the new developments in Apache Spark 2.0 Beta, and Hadoop  3.0.0-alpha1 release ; the lessons learned from Amr´s experience of using Hadoop at Yahoo!; and the business problems that world’s leading organisations do have.

RVZ

Q1. Before Cloudera, you served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organisations to use Hadoop for data analysis and business intelligence. What are the main lessons you learned in that period?

Amr Awadallah: Couple of things. First, I learned that Hadoop is capable of solving all the business intelligence problems that I had at Yahoo.
Namely:
(1) our systems weren’t scaling fast enough (we needed to cut down transformation times from hours to minutes),
(2) our systems weren’t economical on a $/TB basis thus making it hard to retain valuable data for longer time periods, and (3) we needed new methods to be able to store and analyze semi-structured (e.g. logs) and unstructured data (e.g. social media).
By implementing Hadoop in our team we saw first hand how it can address all these problems. The second lesson that I learned was that Hadoop, back then, was very rough to deploy and program against (it took us many months to deploy it and reprogram our transformations to run on it). It was these lessons that made it clear that there is room for a startup to focus on Hadoop since (1) it was solving a very real data problems that many organizations will face, and (2) it needed a lot of polish to make it work smoothly, securely, and reliably within the enterprise.

Q2. In 2008 you founded Cloudera together with Mike Olson (Oracle), Jeff Hammerbacher (Facebook) and Christophe Bisciglia (Google). What was your main motivation at that time?

Amr Awadallah: Pretty much to do what I describe above, we wanted to make the Hadoop technology easy to use for organizations. That included: (1) creating a distribution for Hadoop that bundles all the necessary open-source projects that make it work (we call that CDH, short for Cloudera Distribution for Apache Hadoop). (2) We also created a number of proprietary system management, security, and meta-data management tools around CDH to make it easier for organizations to deploy and operate Hadoop in production.

Q3. What are the typical challenging business problems that world’s leading organisations have?

Amr Awadallah: The technology we provide is very powerful and can be used to solve many problems across many industries, but we see four common themes: The first is simply using Hadoop as a faster, bigger, cheaper system for business intelligence and data analytics. i.e. a lot of organizations just use us to do things they have been doing already, just doing these things in a more economically scalable way.
The second use case is around deeper understanding of customers, i.e. moving away from segmenting all customers into a number of predefined buckets, but rather creating a dynamic micro-segment addressing each customer in a more precise way (thus reducing false positives).
The third use case is about using data to build better products and services, and this use-case is catalyzed by of the internet-of-things. Due to smart-sensors we are able to measure the real-world better than ever before; so this use-case is about taking all that data and leveraging it to either enhance our current product/service offerings, or build entirely new ones.
The fourth use case is about reducing business risk, and it manifests itself in a number of different sub-cases depending on the industry. For example, cyber-security is one of the key ways to reduce risk, and we have an open source project co-developed with Intel, called Apache Spot, which organizations can use to collect all their network flow data then use Spark machine learning algorithms to detect the anomalies in that data. Anti-money laundering and fraud detection is another way that our banking customers employ our platform to reduce risk within their businesses. Similarly, our insurance industry customers use our system to detect fraudulent claims, etc.

Q4. Can they be solved by analysing data? Can you give us some examples of how the use of advanced analytics drive business decisions?

Amr Awadallah: Yes, all the problems mentioned above can be solved with data. I want to highlight though that this isn’t necessarily about business decisions, which is what the Business Intelligence movement was about (we just help make that cheaper and faster). What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).
One of my favorite examples is a solution that one of our customers built to give voice to premature babies in neonatal intensive care units. They analyze the signals coming from the baby (sounds, blood pressure, heart rate, temperature, few brain signals), and based on that a message appears on the monitor above the infant showing the nurse if they are hungry, distressed from too much noise or light, etc.
That is really what we mean by using data to create new products and services that weren’t possible before (and not just reports/dashboard).

Q4. Graphs are important. Is it possible to do scalable graph analytics? If yes, how?

Amr Awadallah: Graphs are indeed important, a lot of our customer use-cases trace back to that (not just for social media analytics, but for example anti-money laundering requires analyzing relationships between many financial accounts for detecting bad behaviors, similarly for cyber security applications). I think scalability depends a fair bit on what’s being analyzed and how scalable we mean by scalable. But for most practical purposes I would say Spark’s GraphX is good enough. For example, you can compute PageRank fairly efficiently and scalably on a cluster using GraphX.

Q5. Data security is increasing important. The risk is due to the growing number of device endpoints. What solutions do exist to minimise such risk?

Amr Awadallah: A comprehensive enterprise data security strategy seeks to mitigate the risks presented by a growing number of potentially compromised endpoints connecting to corporate networks. Endpoint security will enable one or all of the following preventative controls:
The first is policy based enforcement of endpoint security configuration prior to granting and endpoint access to network based corporate assets. This ensures that any endpoint connected to corporate networks meets minimum requirements for endpoint security configuration.
The second measure is endpoint based anti-malware software (the existence of which may be a policy requirement to connect to the network per the first measure). Anti-malware prevents malicious code from infecting endpoints by monitoring for changes to system configuration and unusual activity or processes.
The third measure is endpoint encryption of corporate data on hard drives, folders and even removable media.
As mentioned above we also collaborate with Intel on Apache Spot, which tracks network flow patterns to detect anomalous communication behavior between different devices (including end point devices). Apache Spot just recently won InfoWorld 2017 Tech of the Year Award. Other advanced analytics security partners we closely work with are: CounterTack, Securonix, Niara, and Jask.

Q6. You recently announced the availability of an Apache Spark 2.0 Beta release for users of the Cloudera platform. How does it work? And how does it differ from the Hadoop-based data platform?

Amr Awadallah: First, at a meta-level, Hadoop (MapReduce specifically) was very good at achieving scalable computation by spreading jobs across many CPU cores and hard disk spindles. That said, MapReduce wasn’t very efficient in how it leveraged memory to optimize the performance of data processing pipelines that have many stages or iterations.
The main power of Spark, that made it take over from MapReduce, was how it truly leveraged memory to achieve better performance in deep or iterative data pipelines. That coupled with a simpler developer API made Spark take over very quickly from MapReduce.
Most of our new customer implementations for data processing or data science tend to be in Spark these days, versus MapReduce.
I should clarify however that this doesn’t mean that Hadoop is dead as some say. Apache Hadoop is comprised of three key subsystems: (1) MapReduce for computation, (2) YARN for resource scheduling, and (3) HDFS for storage. Spark only replaces MapReduce, we still rely heavily on both YARN and HDFS.

That said, the most notable features in Apache Spark 2.0 are:

1) Dataset API: It is a new API that represents the distributed collections of objects processed by Spark’s execution engine. It is an extension of Spark’s Dataframe API. It improves upon the Dataframe API by providing type-safe, object oriented programming interfaces. Users can now write User-Defined Functions and Lambda functions that provide compile time type safety. With the Dataset API, users benefit from optimized operations (like sort, join, hash, etc) in the SparkSQL engine, while also getting compile time type safety for user defined functions.

2) Model & Pipeline Persistence in Spark’s ML library: Machine learning Pipelines built with Spark’s ML library can now be serialized to a file and read back in.
The ability to save and reload these pipelines makes it easy for users to perform version control on the pipelines and safely distribute the pipelines. This helps in operationalizing them in production systems.

3) Structured Streaming: New stream processing API and engine that provides SQL like abstractions for authoring operations on data streams, and also improves performance by using the SparkSQL engine for processing the data streams. However, this is still an experimental API and not ready for production usage yet.

Besides the above 3 notable enhancements, there are a bunch of performance and scalability improvements across the board.

Q7. Apache Impala vs. Amazon Redshift: How Does Redshift Compare to Impala?

Amr Awadallah: Apache Impala is an analytic database engine architecturally designed to perform high-performance highly-concurrent SQL analytics on scalable, open data platforms like Hadoop’s HDFS and Amazon S3.
Impala decouples data storage from compute and lets users query data without having to move/load data specifically into an Impala storage-engine (it doesn’t have one). This architectural difference uniquely enables Impala to deliver a more flexible Business Intelligence experience than traditional database architectures like Redshift (which requires pre-loading the data).

Some of the key benefits of the Impala approach include:

* On-demand resources that are immediately ready to query existing S3 data without loading to a different data silo
* Ability to elastically grow/shrink clusters as needed due to decoupled storage and compute
* More predictable, multi-tenant isolation due to the ability to have multiple Impala clusters sharing a common S3 data repository
* Ability to share common data not only amongst Impala clusters, but also any application that runs on cloud-native S3 storage (for example, you can have both Apache Impala and Apache Spark run against the same data asset in S3, while it isn’t possible to have Apache Spark easily access the data stored in Redshift, it has to go through SQL first).
* Greater flexibility to explore new use cases, analytics, and data by directly querying S3 without rigid traditional data models and ETL

Not only does Impala deliver this additional flexibility, it does so at greater cost-performance and scalability compared to Redshift. See the following benchmark for data on that.

That said, Redshift’s sweet spot is in a different target as a smaller datamart as most Redshift installations are in the dozen of nodes range where Redshift’s limitations in scalability, elasticity, flexibility, and requirement to maintain separate copies of data are less critical.

Q8. What is Apache Kudu, and why is it relevant for Impala Users?

Amr Awadallah: Historically we had two storage engines in our distribution: (1) HDFS which is optimized for high-throughput analytics, but doesn’t support updates/inserts and (2) HBase which is optimized for low-latency updates/inserts but isn’t good for doing high-throughput queries. To build a proper data warehouse or time-series analytics system, you typically still need to make updates/inserts and that was why we created Apache Kudu.

Kudu is a new storage system that combines the benefits of both HDFS and HBase into one: it allows for low-latency updates/inserts, but also supports high-throughput analytical queries (i.e. fast analytics on fast moving data).
Unlike HDFS, Kudu is not a file-system, it is a record-based system, so the unit of storage is a record as opposed to a file. This allows Kudu to unlock Impala for real-time streaming applications that were not possible with HDFS.
In HDFS the data would only be visible to Impala after we finish closing the file, which typically happens after a large number of records are accumulated (that adds latency between when records are written to when they become visible to the analytical engine). With Kudu as soon as a record is written it is immediately visible to the Impala analytical engine. Finally, just like HDFS and HBase, the Kudu storage engine is fully integrated with our entire stack, not just Impala.
For example, you can also use Apache Spark for machine-learning jobs directly against Kudu.

Q9. The Apache Hadoop project recently announced its 3.0.0-alpha1 release. What is it?

Amr Awadallah: HDFS Erasure Encoding is really the main exciting new feature in Hadoop 3. Traditionally HDFS required three replicas, by default, for every data block to achieve durability, concurrent performance, and availability. Using erasure encoding techniques, HDFS in Hadoop 3 allows us to significantly reduce the storage overhead from 3x (i.e. 200%) to just 20% extra bits for parity. This will allow us to achieve the same durability benefits of 3x replication, but comes at the cost of potentially lower concurrent performance (when more than one job are trying to access the same block at same time) and lower availability resilience in face of top-of-rack switch failures (less of an issue these days).

Other cool additions are ATS v2 and classpath isolation which you can read more about here

Q10. What is the roadmap ahead for Cloudera Enterprise?

Amr Awadallah: We don’t discuss details of our product roadmap publicly, but there are three guiding themes for us in 2017: The first theme is fast-analytics on fast-moving data (which I covered above in regards to Kudu).
The second theme is cloud, which is making Cloudera Enterprise work better in cloud environments, and make it easier to move workloads (and skill sets) from on-premise clusters to transient cloud clusters in AWS, Azure, and/or Google Cloud.
The third theme is simplifying data-science and machine learning development, especially reducing the time from when a new algorithm is developed to how it can be deployed into production (stay tuned for more on that front).
——————————
Amr Awadallah, Ph.D. Chief Technology Officer, Cloudera
Before co-founding Cloudera in 2008, Amr (@awadallah) was an Entrepreneur-in-Residence at Accel Partners. Prior to joining Accel he served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr joined Yahoo after they acquired his first startup, VivaSmart, in July of 2000. Amr holds a Bachelor’s and Master’s degrees in Electrical Engineering from Cairo University, Egypt, and a Doctorate in Electrical Engineering from Stanford University.

Resources

Download Page for Apache Spark™

Apache Impala supported by Cloudera Enterprise

DATA-X: Videobook- 8 short videos introduce query analytics for Apache Hadoop

A package that allows R developers to use Hadoop HBase

Book: Big Data Analytics with Spark

Related Posts

Streaming Analytics for Chain Monitoring. By Natalino Busa, Head of Data Science at Teradata — Thursday, ODBMS.org January 12, 2017

Five Challenges to IoT Analytics Success. By Dr. Srinath Perera. ODBMS.org SEPTEMBER 23, 2016

Next-Generation Genomics Analysis with Apache Spark. by Jason Bailey. ODBMS.org Thursday, June 30th, 2016

Supporting the Fast Data Paradigm with Apache Spark BY Stephen Dillon, Data Architect, Schneider Electric. ODBMS.org,23 APR, 2016

– The new series of Q&A with Leading Data Scientists– ODBMS.org:
Part II
Part I

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/feed/ 0
Democratizing the use of massive data sets. Interview with Dave Thomas. http://www.odbms.org/blog/2016/09/democratizing-the-use-of-massive-data-sets-interview-with-dave-thomas/ http://www.odbms.org/blog/2016/09/democratizing-the-use-of-massive-data-sets-interview-with-dave-thomas/#comments Mon, 12 Sep 2016 19:04:14 +0000 http://www.odbms.org/blog/?p=4234

“Any important data driving a business decision needs to be sanity checked, just as it would if one was using a spreadsheet.”–Dave Thomas.

I have interviewed Dave Thomas,Chief Scientist at Kx Labs.

RVZ

Q1. For many years business users have had their data locked up in databases and data warehouses. What is wrong with that?

Dave Thomas: It isn’t so much an issue of where the data resides, whether it is in files, databases, data warehouses or a modern data lake. The challenge is that modern businesses need access to the raw data, as well as the ability to rapidly aggregate and analyze their data.

Q2. Typical business intelligence (BI) tool users have never seen their actual data. Why?

Dave Thomas: For large corporations hardware and software both used to be prohibitively expensive, hence much of their data was aggregated prior to making it available to users. Even today when machines are very inexpensive most corporate IT infrastructures are impoverished relative to what one can buy on the street or in the Cloud.
Compounding the problem, IT charge-back mechanisms are biased to reduce IT spending rather than to maximize the value of data delivered to the business.
Traditional technologies are not sufficiently performant to allow processing of large volumes of data.
Many companies have inexpensive data lakes and have realized after the fact that using a commodity storage systems, such as HDFS, has severely constrained their performance and limited their utility. Hence more corporations are moving data away from HDFS into high-performance storage or memory.

Q3. What are the limitations of the existing BI and extract, transform and load (ETL) data tools?

Dave Thomas: Traditional BI tools assume that it is possible for DBAs and BI experts to a priori define the best way to structure and query the data. This reduces the whole power of BI to mere reporting. In an attempt to deal with huge BI backlogs, generic query and reporting tools have become popular to shift reporting to self-serve. However, they are often designed for sophisticated BI users rather than for normal business users. They are often not performant because they depend on the implementation of the underlying data stores.
For the most part, existing ETL tools are constrained by having to move the data to the ETL process and then on to the end user. Many ETL tools only work against one kind of data source. ETL can’t be written by normal users and due to the cost of an incorrect ETL run, such tools are not available to the data analyst. One of the major topics of discussion in Big Data shops is the complexity and performance of their Big Data pipeline. ETL, data blending, shouldn’t be a separate process or product. It should be something one can do with queries in a single efficient data language.

Q4. What are the typical technical challenges in finance, IoT and other time-series applications?

Dave Thomas:
1. Speed, as data volumes and variety are always increasing.
2. Ability to deal with both real-time events and historical events efficiently. Ideally in a single technology.
3. To handle time-series one needs to be able to deal with simultaneous arrival of events. Time with nanosecond precision is our solution. Other solutions are constrained by using milliseconds and event counters that are much less efficient.
4. High-performance operations on time, over days, months and years are essential for time-series. This is why time is a native type in Kx.
5. The essence of time-series is processing sliding time windows of data for both joins and aggregations.
6. In IOT, data is always dirty. Kx’s native support for missing data and out of band data due to failing sensors, allows one to deal with the realities of sensor data.

Q5. Kx offers analysts a language called q. Why not extend standard SQL?

Dave Thomas: I think there is a misunderstanding about q. Q is a full functional data language that both includes and extends SQL. Selects are easier than SQL because they provide implicit joins and group-bys. This makes queries roughly 50% of the code of SQL. Unlike many flavors of SQL, q lets one put a functional expression in any position in an SQL statement. One can easily extend the aggregation operations available to the end-user.

Q6. Can you show the difference between a query written in q and in standard SQL?

Dave Thomas: Here’s an example of retrieving parts from an orders table with a foreign key join to a parts table, summing by quantity and then sorting by color:

q:
select sum qty by p.color from sp

SQL:
select p.color, sum(sp.qty) from sp, p
where sp.p=p.p group by p.color order by color

Q7. How do queries execute inside the database?

Dave Thomas: Q is native to the database engine. Hence queries and analytics execute in the columns of the Kx database. There is no data shipping between the client and database server.

Q8. Shawn Rogers of Dell said: “A ‘citizen data scientist’ is an everyday, non-technical user that lacks the statistical and analytical prowess of a traditional data scientist, but is equally eager to leverage data in order to uncover insights, and importantly, do so at the speed of business.” What is your take on this?

Dave Thomas: High-performance data technologies, such as Kx, using modern large-memory hardware, can support data analysts versus data scientist queries. In the product Analyst for Kx, for example, users can work interactively on a sample of data using visual tools to import, clean, query, transform, analyze and visualize data with minimal, if any programming or even SQL. Given correct operations on one or more samples they then can be run against trillions of rows of data. Data analysts today can truly live in their data.

Q9. What are the risks of bringing the power of analytics to users who are non-expert programmers?

Dave Thomas: Clearly any important analysis needs to be validated and cross-checked. Hence any important data driving a business decision needs to be sanity checked, just as it would if one was using a spreadsheet.
In our experience users do make initial mistakes, but as they live in their data they quickly learn.
Visualization really helps, as does the provision of metadata about the data sources. Reducing the cycle time provides increased understanding, and allows one to make mistakes.
Runaway query performance has been a concern of DBAs, but for many years frameworks have been in place such as our smart query router that will ensure that ad hoc queries against massive datasets are throttled so they don’t run away. Fortunately, recent cost reductions in non-volatile memory make it possible to have high-performance query-only replicas of data that can be made available to different parts of the organization based on its needs.

Q10. How can non-expert programmers understand if the information expressed in visual analytics such as heat maps or in operational dashboard charts, is of good quality or not?

Dave Thomas: In our experience users spot visual anomalies much faster than inconsistencies in a spreadsheet.

Q11. What are the opportunities arising in “democratizing” the use of massive data sets?

Dave Thomas: We are finally living in a world where for many companies it is possible to run a real-time business where everyone can have fast, efficient access to the data they need. Rather than being held hostage to aggregations, spreadsheets and all sorts of variants of the truth, the organization can expediently see new opportunities to improve results in sales, marketing, production and other business operations.

Q12. How important is data query and data semantics?

Dave Thomas: Unfortunately we are not educated on how to express data semantics and data query.
Even computer scientists often study less about writing queries than how to execute them efficiently.
We need to educate students and employees on how to live in their data. It may well be that the future of programming for most will be writing queries. Given powerful data languages even compiler optimizations can be expressed by queries.
We need to invest much more in data governance and the use of standard terminology in order to share data within and across companies.

——————-
Dave Thomas, Kx Labs.
As Chief Scientist Dave envisions the future roadmap for Kx tools. Dave has had a long and storied career in computer software development and is perhaps best known as the founder and past CEO of Object Technology International, formerly OTI, now IBM OTI Labs, a pioneer in Agile Product Development. He was the principal visionary and architect for IBM VisualAge Smalltalk and Java tools and virtual machines including the popular open-source, multi-language Eclipse.org IDE. As the cofounder of Bedarra Research Labs he led the creation of the Ivy visual analytics workbench. Dave is a renowned speaker, university lecturer and Chairman of the Australian developer YOW! conferences.

Resources

New Kx release includes encryption, enhanced compression and Tableau integration. ODBMS.org JULY 4, 2016.

Resources for learning more about kdb+ and q benchmarking results.

Kdb+ and the Internet of Things/Big Data. InDetail Paper by Bloor Research Author: Philip Howard. ODBMS.org- JANUARY 28, 2015

Related Posts

Democratizing fast access to Big Data. By Dave Thomas, chief scientist at Kx Labs. ODBMS.org-April 26, 2016

On Data Governance. Interview with David Saul. ODBMS Industry Watch, Published on 2016-07-23

On the Challenges and Opportunities of IoT. Interview with Steve Graves. ODBMS Industry Watch, Published on 2016-07-06

On Data Analytics and the Enterprise. Interview with Narendra Mulani. ODBMS Industry Watch, Published on 2016-05-24

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/09/democratizing-the-use-of-massive-data-sets-interview-with-dave-thomas/feed/ 0
LinkedIn China new Social Platform Chitu. Interview with Dong Bin. http://www.odbms.org/blog/2016/08/linkedin-china-new-social-platform-chitu-interview-with-dong-bin/ http://www.odbms.org/blog/2016/08/linkedin-china-new-social-platform-chitu-interview-with-dong-bin/#comments Thu, 04 Aug 2016 19:27:57 +0000 http://www.odbms.org/blog/?p=4181

“Complicated queries, like looking for second degree friends, is really hard to traditional databases.” –Dong Bin

I have interviewed Dong Bin, Engineer Manager at LinkedIn China. The LinkedIn China development team launched a new social platform — known as Chitu — to attract a meaningful segment of the Chinese professional networking market.

RVZ

Q1. What is your role at LinkedIn China?

Dong Bin: I am an Engineer Manager in charge of the backend services for Chitu. The backend includes all Chitu`s consumer based features, like feeds, chat, event, etc.

Q2. You recently launched a new social platform, called Chitu. Which segment of the Chinese professional networking market are you addressing with Chitu? How many users do you currently have?

Dong Bin: Unlike Linkedin.com, Chitu is targeting on young people without strong background, who mostly work at second-tier cities. They are eager to learn how to promote their career path. Due to business reasons, the members count can not be published yet. Sorry for that.

Q3. What are the main similarities and differences of Chitu with respect to LinkedIn?

Dong Bin: Besides the difference of user targeting, Chitu involves more popular features like Live Mode and knowledge monetization. And the Chitu team worked as a startup, which make the product run extremely fast. It is the key to beat the local competitors.

Q4. Who are your main competitors in China?

Dong Bin: The main competitors are: Maimai and Liepin.

Q5. What were the main challenges in developing Chitu?

Dong Bin: 1. At the beginning of the development, Chitu needed to be launched on an impossible deadline to catch up with competitors, by a team of engineers less than 20. 2. So many hot features are proposed which are so complicated from an implementation perspective, like friends with 1/2/3 degree, realtime chatting. They are tough problems for traditional infrastructure.

Q6. Why did you use a graph database for developing Chitu and not a conventional relational database?

Dong Bin: For development efficiency, I need a schemaless database which can handle relationships very easily. Schema will be a pain for fast iteration cause migration in many environment. And complicated queries, like looking for second degree friends, is really hard to traditional databases. Then I found graph database just fit my requirement.
Then I found graph database is good at performance of query connected data. With more than 10 years of experience of using relational database, I know that complicated joins are the performance killer. But graph databases kick ass of other databases.

Q7. What are the main advantages did you experience in using Neo4j?

Dong Bin: 1. I decide to use graph database and I found the No.1 graph database is Neo4j which make me no other choice; 2. Neo4J has a native graph storage; 3. The community is active and document is so rich, though it is comparable to MySQL or Oracle; 4. It is very fast.

Q8. Did you evaluate other graph databases in the market, other then Neo4j? If yes, which ones?

Dong Bin: Yes, I have evaluated OrientDB. I didn’t choose it because 1) it is not native graph storage, which make concern about performance;  2) the community and the documentation are weak.

Q9. Can you be a bit more specific, and explain what do you do with the Neo4j native graph storage, and why is it important for your application?

Dong Bin: Because native graph storage can handle query with joins very quickly. Chitu has so many queries depending on that. I have experience on that.

Q10. When you say, Neo4J is very fast, did you do any performance benchmarks? If yes, can you share the results? Did you do performance comparisons with other databases? 

Dong Bin: We did have some rough benchmarks, but now we focus on production performance metrics. In production log, I can see that 99% of the queries need no more than 10ms. This is the data I can provide with confidence.

Q11. What is the roadmap ahead for Chitu?

Dong Bin: The long-term goal is becoming the No.1 professional network platform in China. Also, Chitu will focus on knowledge sharing and monetization.

———–
Dong Bin is an Engineer Manager at Linkedin China. He has more than ten years experience of building web and database applications. His main interests are architecture for high performance and high stability. He has several years of database experience for MySQL, Redis and Mongodb, and fall in love with Graph DB after learning about Neo4j. Prior joining to Linkedin, he worked at Kabam as an Engineer Lead for developing mobile strategy game. He obtain a M.S in Harbin Institute of Technology in China. 

Resources

Chitu: Chitu is a social network app created by LinkedIn China.

– Neo4j Graph Database Helps LinkedIn China Launch Separate Professional Social Networking App

– Graph Databases for Beginners: Native vs. Non-Native Graph Technology

 Graph Databases. by Ian Robinson, Jim Webber, and Emil Eifrem. Published by O’Reilly Media, Inc. Second edition (224 pages).

Related Posts

– The Panama Papers: Why It Couldn’t Have Happened Ten Years Ago By Emil Eifrem, CEO, Neo Technology, ODBMS.org April 6, 2016

– Forrester Report: Graph Databases Market Overview, ODBMS.org,  AUGUST 31, 2015

– Embracing the evolution of Graphs. by Stephen Dillon, Data Architect, Schneider Electric. ODBMS.org, January 2015.

Graph Databases for Beginners: Why Data Relationships Matter. By Bryce Merkl Sasaki, ODBMS.org, July 31, 2015

– Graph Databases for Beginners: The Basics of Data Modeling. By BRYCE MERKL SASAKI, ODBMS.org, AUGUST 7, 2015

Graph Databases for Beginners: Why a Database Query Language Matters. BY BRYCE MERKL SASAKI, ODBMS.org, AUGUST 21, 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/08/linkedin-china-new-social-platform-chitu-interview-with-dong-bin/feed/ 2