“You can’t get good insights from bad data, and AI is playing an instrumental role in the data preparation renaissance.”–Narendra Mulani
I have interviewed Narendra Mulani, chief analytics officer, Accenture Analytics.
Q1. What is the role of Artificial Intelligence in analytics?
Narendra Mulani: Artificial Intelligence will be the single greatest change driver of our age. Combined with analytics, it’s redefining what’s possible by unlocking new value from data, changing the way we interact with each other and technology, and improving the way we make decisions. It’s giving us wider control and extending our capabilities as businesses and as people.
AI is also the connector and culmination of many elements of our analytics strategy including data, analytics techniques, platforms and differentiated industry skills.
You can’t get good insights from bad data, and AI is playing an instrumental role in the data preparation renaissance.
AI-powered analytics essentially frees talent to focus on insights rather than data preparation which is more daunting with the sheer volume of data available. It helps organizations tap into new unstructured, contextual data sources like social, video and chat, giving clients a more complete view of their customer. Very recently we acquired Search Technologies who possess a unique set of technologies that give ‘context to content’ – whatever its format – and make it quickly accessible to our clients.
As a result, we gain more precise insights on the “why” behind transactions for our clients and can deliver better customer experiences that drive better business outcomes.
Overall, AI-powered analytics will go a long way in allowing the enterprise to find the trapped value that exists in data, discover new opportunities and operate with new agility.
Q2. How can enterprises become ‘data native’ and digital at the core to help them grow and succeed?
Narendra Mulani: It starts with embracing a new culture which we call ‘data native’. You can’t be digital to the core if you don’t embed data at the core. Getting there is no mean feat. The rate of change in technology and data science is exponential, while the rate at which humans can adapt to this change is finite. In order to close the gap, businesses need to democratize data and get new intelligence to the point where it is easily understood and adopted across the organization.
With the help of design-led analytics and app-based delivery, analytics becomes a universal language in the organization, helping employees make data-driven decisions, collaborate across teams and collectively focus efforts on driving improved outomes for the business.
Enterprises today are only using a small fraction of the data available to them as we have moved from the era of big data to the era of all data. The comprehensive, real-time view businesses can gain of their operations from connected devices is staggering.
But businesses have to get a few things right to ensure they go on this journey.
Understanding and embracing convergence of analytics and artificial intelligence is one of them. You can hardly overstate the impact AI will have on mobilizing and augmenting the value in data, in 2018 and beyond. AI will be the single greatest change driver and will have a lasting effect on how business is conducted.
Enterprises also need to be ready to seize new opportunities – and that means using new data science to help shape hypotheses, test and optimize proofs-of-concept and scale quickly. This will help you reimagine your core business and uncover additional revenue streams and expansion opportunities.
All this requires a new level of agility. To help our clients act and respond fast, we support them with our platforms, our people and our partners. Backed by deep analytics expertise, new cloud-based systems and a curated and powerful alliance and delivery network, our priority is architecting the best solution to meet the needs of each client. We offer an as-a-service engagement model and a suite of intelligent industry solutions that enable even greater agility and speed to market.
Q3. Why is machine learning (ML) such a big deal, where is it driving changes today, and what are the big opportunities for it that have not yet been tapped?
Narendra Mulani: Machine learning allows computers to discover hidden or complex patterns in data without explicit programming. The impact this has on the business is tremendous—it accelerates and augments insights discovery, eliminates tedious repetitive tasks, and essentially enables better outcomes. It can be used to do a lot of good for people, from reading a car’s license plate and forcing the driver to slow down, to allowing people to communicate with others regardless of the language they speak, and helping doctors find very early evidence of cancer.
While the potential we’re seeing for ML and AI in general is vast, businesses are still in the infancy of tapping it. Organizations looking to put AI and ML to use today need to be pragmatic. While it can amplify the quality of insights in many areas, it also increases complexity for organizations, in terms of procuring specialized infrastructure or in identifying and preparing the data to train and use AI, and with validating the results. Identifying the real potential and the challenges involved are areas where most companies today lack the necessary experience and skills and need a trusted advisor or partner.
Whenever we look at the potential AI and ML have, we should also be looking at the responsibility that comes with it. Explainable AI and AI transparency are top of mind for many computer scientists, mathematicians and legal scholars.
These are critical subjects for an ethical application of AI – particularly critical in areas such as financial services, healthcare and life sciences – to ensure that data use is appropriate, and to assess the fairness of derived algorithms.
We need recognize that, while AI is science, and science is limitless, there are always risks in how that science is used by humans, and proactively identify and address issues this might cause for people and society.
Narendra Mulani is Chief Analytics Officer of Accenture Analytics, a practice that his passion and foresight have helped shape since 2012.
A connector at the core, Narendra brings machine learning, data science, data engineers and the business closer together across industries and geographies to embed analytics and create new intelligence, democratize data and foster a data native culture.
He leads a global team of industry and function-specific analytics professionals, data scientists, data engineers, analytics strategy, design and visualization experts across 56 markets to help clients unlock trapped value and define new ways to disrupt in their markets. As a leader, he believes in creating an environment that is inspiring, exciting and innovative.
Narendra takes a thoughtful approach to developing unique analytics strategies and uncovering impactful outcomes. His insight has been shared with business and trade media including Bloomberg, Harvard Business Review, Information Management, CIO magazine, and CIO Insight. Under Narendra’s leadership, Accenture’s commitment and strong momentum in delivering innovative analytics services to clients was recognized in Everest Group’s Analytics Business Process Services PEAK Matrix™ Assessment in 2016.
Narendra joined Accenture in 1997. Prior to assuming his role as Chief Analytics Officer, he was the Managing Director – Products North America, responsible for delivering innovative solutions to clients across industries including consumer goods and services, pharmaceuticals, and automotive. He was also managing director of supply chain for Accenture Management Consulting where he led a global practice responsible for defining and implementing supply chain capabilities at a diverse set of Fortune 500 clients.
Narendra graduated with a Bachelor of Commerce degree at Bombay University, where he was introduced to statistics and discovered he understood probability at a fundamental level that propelled him on his destined career path. He went on to receive an MBA in Finance in 1982 as well as a PhD in 1985 focused on Multivariate Statistics, both from the University of Massachusetts. Education remains fundamentally important to him.
As one who logs too many frequent flier miles, Narendra is an active proponent of taking time for oneself to recharge and stay at the top of your game. He practices what he preaches through early rising and active mindfulness and meditation to keep his focus and balance at work and at home. Narendra is involved with various activities that support education and the arts, and is a music enthusiast. He lives in Connecticut with his wife Nita and two children, Ravi and Nikhil.
Follow us on Twitter: @odbmsorg
” Open source software comes with a promise, and that promise is not about looking at the code, rather it’s about avoiding vendor lock-in.” –Jacque Istok.
” The cloud has out-paced the data center by far and we should expect to see the entire database market being replatformed into the cloud within the next 5-10 years.” –Mike Waas.
I have interviewed Jacque Istok, Head of Data Technical Field for Pivotal, and Mike Waas, founder and CEO Datometry.
Main topics of the interview are: the future of Data Warehousing, how are open source and the Cloud affecting the Data Warehouse market, and Datometry Hyper-Q and Pivotal Greenplum.
Q1. What is the future of Data Warehouses?
Jacque Istok: I believe that what we’re seeing in the market is a slight course correct with regards to the traditional data warehouse. For 25 years many of us spent many cycles building the traditional data warehouse.
The single source of the truth. But the long duration it took to get alignment from each of the business units regarding how the data related to each other combined with the cost of the hardware and software of the platforms we built it upon left everybody looking for something new. Enter Hadoop and suddenly the world found out that we could split up data on commodity servers and, with the right human talent, could move the ball forward faster and cheaper. Unfortunately the right human talent has proved hard to come by and the plethora of projects that have spawned up are neither production ready nor completely compliant or compatible with the expensive tools they were trying to replace.
So what looks to be happening is the world is looking for the features of yesterday combined with the cost and flexibility of today. In many cases that will be a hybrid solution of many different projects/platforms/applications, or at the very least, something that can interface easily and efficiently with many different projects/platforms/applications.
Mike Waas: Indeed, flexibility is what most enterprises are looking for nowadays when it comes to data warehousing. The business needs to be able to tap data quickly and effectively. However, in today’s world we see an enormous access problem with application stacks that are tightly bonded with the underlying database infrastructure. Instead of maintaining large and carefully curated data silos, data warehousing in the next decade will be all about using analytical applications from a quickly evolving application ecosystem with any and all data sources in the enterprise: in short, any application on any database. I believe data warehouses remain the most valuable of databases, therefore, cracking the access problem there will be hugely important from an economic point of view.
Q2. How is open source affecting the Data Warehouse market?
Jacque Istok: The traditional data warehouse market is having its lunch eaten by open source. Whether it’s one of the Hadoop distributions, one of the up and coming new NoSQL engines, or companies like Pivotal making large bets and open source production proven alternatives like Greenplum. What I ask prospective customers is if they were starting a new organization today, what platforms, databases, or languages would you choose that weren’t open source? The answer is almost always none. Open source software comes with a promise, and that promise is not about looking at the code, rather it’s about avoiding vendor lock-in.
Mike Waas: Whenever a technology stack gets disrupted by open source, it’s usually a sign that the technology has reached a certain maturity and customers have begun doubting the advantage of proprietary solutions. For the longest time, analytical processing was considered too advanced and too far-reaching in scope for an open source project. Greenplum Database is a great example for breaking through this ceiling: it’s the first open source database system with a query optimizer not only worth that title but setting a new standard, and a whole array of other goodies previously only available in proprietary systems.
Q3. Are databases an obstacle to adopting Cloud-Native Technology?
Jacque Istok: I believe quite the contrary, databases are a requirement for Cloud-Native Technology. Any applications that are created need to leverage data in some way. I think where the technology is going is to make it easier for developers to leverage whichever database or datastore makes the most sense for them or they have the most experience with – essentially leveraging the right tool for the right job, instead of the tool “blessed” by IT or Operations for general use. And they are doing this by automating the day 0, day 1, and day 2 operations of those databases. Making it easy to instantiate and use these platforms for anyone, which has never really been the case.
Mike Waas: In fact, a cloud-first strategy is incomplete unless it includes the data assets, i.e., the databases. Now, databases have always been one of the hardest things to move or replatform, and, naturally, it’s the ultimate challenge when moving to the cloud: firing up any new instance in the cloud is easy as 1-2-3 but what to do with the 10s of years of investment in application development? I would say it’s actually not the database that’s the obstacle but the applications and their dependencies.
Q4. What are the pros and cons of moving enterprise data to the cloud?
Jacque Istok: I think there are plenty of pros to moving enterprise data to the cloud, the extent of that list will really depend on the enterprise you’re talking to and the vertical that they are in. But cons? The only cons would be using these incredible tools incorrectly, at which point you might find yourself spending more money and feeling that things are slower or less flexible. Treating the cloud as a virtual data center, and simply moving things there without changing how they are architected or how they are used would be akin to taking
Mike Waas: I second that. A few years ago enterprises were still concerned about security, completeness of offering, and maturity of the stack. But now, the cloud has out-paced the data center by far and we should expect to see the entire database market being replatformed into the cloud within the next 5-10 years. This is going to be the biggest revolution in the database industry since the relational model with great opportunities for vendors and customers alike.
Q5. How do you quantify when is appropriate for an enterprise to move their data management to a new platform?
Jacque Istok: It’s pretty easy from my perspective, when any enterprise is done spending exorbitant amounts of money it might be time to move to a new platform. When you are coming up on a renewal or an upgrade of a legacy and/or expensive system it might be time to move to a new platform. When you have new initiatives to start it might be time to move to a new platform. When you are ready to compete with your competitors, both known and unknown (aka startups), it might be time to move to a new platform. The move doesn’t have to be scary either, as some products are designed to be a bridge to a modern a data platform.
Mike Waas: Traditionally, enterprises have held off from replatforming for too long: the switching cost has deterred them from adopting new and highly superior technology with the result that they have been unable to cut costs or gain true competitive advantage. Staying on an old platform is simply bad for business. Every organization needs to ask themselves constantly the question whether their business can benefit from adopting new technology. At Datometry, we make it easy for enterprises to move their analytics — so easy, in fact, the standard reaction to our technology is, “this is too good to be true.”
Q6. What is the biggest problem when enterprises want to move part or all of their data management to the cloud?
Jacque Istok: I think the biggest problem tends to be not architecting for the cloud itself, but instead treating the cloud like their virtual data center. Leveraging the same techniques, the same processes, and the same architectures will not lead to the cost or scalability efficiencies that you were hoping for.
Mike Waas: As Jacque points out, you really need to change your approach. However, the temptation is to use the move to the cloud as a trigger event to rework everything else at the same time. This quickly leads to projects that spiral out of control, run long, go over budget, or fail altogether. Being able to replatform quickly and separate the housekeeping from the actual move is, therefore, critical.
However, when it comes to databases, trouble runs deeper as applications and their dependencies on specific databases are the biggest obstacle. SQL code is embedded in thousands of applications and, probably most surprising, even third-party products that promise portability between databases get naturally contaminated with system-specific configuration and SQL extensions. We see roughly 90% of third-party systems (ETL, BI tools, and so forth) having been so customized to the underlying database that moving them to a different system requires substantial effort, time, and money.
Q7. How does an enterprise move the data management to a new platform without having to re-write all of the applications that rely on the database?
Mike Waas: At Datometry, we looked very carefully at this problem and, with what I said above, identified the need to rewrite applications each time new technology is adopted as the number one problem in the modern enterprise. Using Adaptive Data Virtualization (ADV) technology, this will quickly become a problem of the past! Systems like Datometry Hyper-Q let existing applications run natively and instantly on a new database without requiring any changes to the application. What would otherwise be a multi-year migration project and run into the millions, is now reduced in time, cost, and risk to a fraction of the conventional approach. “VMware for databases” is a great mental model that has worked really well for our customers.
Q8. What is Adaptive Data Virtualization technology, and how can it help adopting Cloud-Native Technology?
Mike Waas: Adaptive Data Virtualization is the simple, yet incredibly powerful, abstraction of a database: by intercepting the communication between application and database, ADV is able to translate in real-time and dynamically between the existing application and the new database. With ADV, we are drawing on decades of database research and solving what is essentially a compatibility problem between programming languages and systems with an elegant and highly effective approach. This is a space that has traditionally been served by consultants and manual migrations which are incredibly labor-intensive and expensive undertaking.
Through ADV, adopting cloud technology becomes orders of magnitude simpler as it takes away the compatibility challenges that hamper any replatforming initiative.
Q9. Can you quantify what are the reduced time, cost, and risk when virtualizing the data warehouse?
Jacque Istok: In the past, virtualizing the data warehouse meant sacrificing performance in order to get some of the common benefits of virtualization (reduced time for experimentation, maximizing resources, relative ease to readjust the architecture, etc). What we have found recently is that virtualization, when done correctly, actually provides no sacrifices in terms of performance, and the only question becomes whether or not the capital cost expenditure of bare metal versus the opex cost structure of virtual is something that makes sense for your organisation.
Mike Waas: I’d like to take it a step further and include ADV into this context too: instead of a 3-5 year migration, employing 100+ consultants, and rewriting millions of lines of application code, ADV lets you leverage new technology in weeks, with no re-writing of applications. Our customers can expect to save at least 85% of the transition cost.
Q10. What is the massively parallel processing (MPP) Scatter/Gather Streaming™ technology, and what is it useful for?
Jacque Istok: This is arguably one of the most powerful features of Pivotal Greenplum and it allows for the fastest loading of data in the industry. Effectively we scatter data into the Greenplum data cluster as fast as possible with no care in the world to where it will ultimately end up. Terabytes of data per hour, basically as much as you can feed down the wires, is sent to each of the workers within the cluster. The data is therefore disseminated to the cluster in the fastest physical way possible. At that point, each of the workers gathers the data that is pertinent to them according to the architecture you have chosen for the layout of those particular data elements, allowing for a physical optimization to be leveraged during interrogation of the data after it has been loaded.
Q11. How Datometry Hyper-Q & Pivotal Greenplum data warehouse work together?
Jacque Istok: Pivotal Greenplum is the world’s only true open source, production proven MPP data platform that provides out of the box ANSI compliant SQL capabilities along with Machine Learning, AI, Graph, Text, and Spatial analytics all in one. When combined with Datometry Hyper-Q, you can transparently and seamlessly take any Teradata application and, without changing a single line of code or a single piece of SQL, run it and stop paying the outrageous Teradata tax that you have been bearing all this time. Once you’re able to take out your legacy and expensive Teradata system, without a long investment to rewrite anything, you’ll be able to leverage this software platform to really start to analyze the data you have. And that analysis can be either on premise or in the cloud, giving you a truly hybrid and cross-cloud proven platform.
Mike Waas: I’d like to share a use case featuring Datometry Hyper-Q and Pivotal Greenplum featuring a Fortune 100 Global Financial Institution needing to scale their business intelligence application, built using 2000-plus stored procedures. The customer’s analysis showed that replacing their existing data warehouse footprint was prohibitively expensive and rewriting the business applications to a more cost-effective and modern data warehouse posed significant expense and business risk. Hyper-Q allowed the customer to transfer the stored procedures in days without refactoring the logic of the application and implement various control-flow primitives, a time-consuming and expensive proposition.
Qx. Anything else you wish to add?
Jacque Istok: Thank you for the opportunity to speak with you. We have found that there has never been a more valid time than right now for customers to stop paying their heavy Teradata tax and the combination of Pivotal Greenplum and Datometry Hyper-Q allows them to do that right now, with no risk, and immediate ROI. On top of that, they are then able to find themselves on a modern data platform – one that allows them to grow into more advanced features as they are able. Pivotal Greenplum becomes their bridge to transforming your organization by offering the advanced analytics you need but giving you traditional, production proven capabilities immediately. At the end of the day, there isn’t a single Teradata customer that I’ve spoken to that doesn’t want Teradata-like capabilities at Hadoop-like prices and you get all this and more with Pivotal Greenplum.
Mike Waas: Thank you for this great opportunity to speak with you. We, at Datometry, believe that data is the key that will unlock competitive advantage for enterprises and without adopting modern data management technologies, it is not possible to unlock value. According to the leading industry group, TDWI, “today’s consensus says that the primary path to big data’s business value is through the use of so-called ‘advanced’ forms of analytics based on technologies for mining, predictions, statistics, and natural language processing (NLP). Each analytic technology has unique data requirements, and DWs must modernize to satisfy all of them.”
We believe virtualizing the data warehouse is the cornerstone of any cloud-first strategy because data warehouse migration is one of the most risk-laden and most expensive initiatives that a company can embark on during their journey to to the cloud.
Interestingly, the cost of migration is primarily the cost of process and not technology and this is where Datometry comes in with its data warehouse virtualization technology.
We are the key that unlocks the power of new technology for enterprises to take advantage of the latest technology and gain competitive advantage.
Jacque Istok serves as the Head of Data Technical Field for Pivotal, responsible for setting both data strategy and execution of pre and post sales activities for data engineering and data science. Prior to that, he was Field CTO helping customers architect and understand how the entire Pivotal portfolio could be leveraged appropriately.
A hands on technologist, Mr. Istok has been implementing and advising customers in the architecture of big data applications and back end infrastructure the majority of his career.
Prior to Pivotal, Mr. Istok co-founded Professional Innovations, Inc. in 1999, a leading consulting services provider in the business intelligence, data warehousing, and enterprise performance management space, and served as its President and Chairman. Mr. Istok is on the board of several emerging startup companies and serves as their strategic technical advisor.
Mike Waas, CEO Datometry, Inc.
Mike Waas founded Datometry after having spent over 20 years in database research and commercial database development. Prior to Datometry, Mike was Sr. Director of Engineering at Pivotal, heading up Greenplum’s Advanced R&Dteam. He is also the founder and architect of Greenplum’s ORCA query optimizer initiative. Mike has held senior engineering positions at Microsoft, Amazon, Greenplum, EMC, and Pivotal, and was a researcher at Centrum voor Wiskunde en Informatica (CWI), Netherlands, and at Humboldt University, Berlin.
Mike received his M.S. in Computer Science from University of Passau, Germany, and his Ph.D. in Computer Science from the University of Amsterdam, Netherlands. He has authored or co-authored 36 publications on the science of databases and has 24 patents to his credit.
– On Open Source Databases. Interview with Peter Zaitsev, ODBMS Industry Watch, Published on 2017-09-06
– On Apache Ignite, Apache Spark and MySQL. Interview with Nikita Ivanov , ODBMS Industry Watch, Published on 2017-06-30
– On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah, ODBMS Industry Watch, Published on 2017-03-13
Follow us on Twitter: @odbmsorg
“With multi scale dataflow computing, we adjust the structure of the computer to the problem, rather than spending countless hours molding the problem into a computer language which is then interpreted by a microprocessor in an endless game of “Chinese whispers”. The poor microprocessor has no chance to figure out what the original problem might have been. We take a specific problem and program your computer to only solve that problem, or teach you to do it yourself. This means that the microprocessor does not waste energy, time and power on trying to figure out what needs to be computed next.”–Devin Graham.
I have interviewed Devin Graham, in charge of Finance Risk Products at Maxeler Technologies. We covered in the interview the challenges and opportunities for risk managers and how dataflow technology is transforming the industry.
Q1. What are the typical functions of a chief risk officer?
Devin Graham: To minimize risk across four categories; market risk, operational risk, credit risk and regulatory risk. For market risk you are trying to maximize the potential profit of your institution whilst ensuring you have the lowest amount of volatile risk. With operational risk you need to look at your business processes and ensure you have systems and controls in place that minimise any negative financial impacts to running your business. To manage credit risk you need to minimise the risk of the exposure of your assets and profits to your counterparties. Regulatory risk management involves ensuring the business is aware of and follows regulations.
Q2. What are the main challenges at present for financial risk management?
Devin Graham: The data sets you are dealing with now are very large. The challenge today is meeting the complexity and vastness of this data with speed – in real time. The velocity of data also poses challenges around security, particularly with threats of intrusion and spoofing attacks which are much harder to detect when there is so much data to analyse. Your computer needs to work out the patterns of serial spoofers, and CPUs with standard software stacks are overwhelmed by the challenge.
Q3. In which way can data flow technology be useful for risk management for the finance industry?
Devin Graham: Our dataflow technology provides complex calculations at maximum speed, for running analytics on large scale data sets and for line rate processing of trade flow and matching, as well as data enrichment. Multiscale Dataflow provides the technology to bridge over today’s financial capability gap, providing real and measurable competitive advantage.
Q4. What exactly is dataflow computing?
Devin Graham: With multiscale dataflow computing, we adjust the structure of the computer to the problem, rather than spending countless hours molding the problem into a computer language which is then interpreted by a microprocessor in an endless game of “Chinese whispers”.
The poor microprocessor has no chance to figure out what the original problem might have been. We take a specific problem and program your computer to only solve that problem, or teach you to do it yourself. This means that the microprocessor does not waste energy, time and power on trying to figure out what needs to be computed next.
In a financial context, multiscale dataflow makes it possible to analyse risk in real time, rather than off-line, looking at risk in the future, rather than computing the risk of the past.
Q5. What are the main differences in performing dataflow computation, from computing with conventional CPUS?
Devin Graham: The main difference is that dataflow provides computational power at much lower energy consumption, much higher performance density and greater speed at tremendous savings in total cost of ownership. It is ideal for dealing with Big Complex Data.
More technically, CPUs solve equations linearly – through time. Dataflow computes vast numbers of equations as a graph, with data flowing through the nodes all at the same time. Complex calculations happen as a side effect of the data flowing through a graph which looks like the structure of your problem.
Q6. Do you have any measures to share with us on the benefits in performance, space and power consumption?
Devin Graham: Maxeler’s Dataflow technology enables organisations to speed up processing times by 20-50x when comparing computing boxes of the same size, with over 90% reduction in energy usage and over 95% reduction in data center space. Taking one of our customers as an example. They were able to run computations of 50 compute nodes, in a single dataflow node. Such ability brings 32 Maxeler dataflow nodes to an equivalent of 1,600 CPU nodes, delivering operational cost saving of £3.2 million over 3 years.
In a financial risk context the advantages of Multiscale Dataflow Computing enable the analysis of thousands of market scenarios in minutes rather than hours. A Tier 1 investment bank recently delivered portfolio pricing and risk in seconds, down from minutes.
Q7. What is the new paradigm for financial risk management defined by Maxeler Technologies?
Devin Graham: The new paradigm shift resulting from Maxeler’s technology enables traders and risk managers with a super power: real-time data analysis. The technology is available right here and right now, as opposed to other technologies which remain on the horizon, or require a datacenter to be cooled down to 0 Kelvin to compute a few bits of results.
Dataflow computing works at room temperature, without the need to cool things down to the point where even the smallest particles stop moving.
Since we describe Dataflow programs in Java, it is easy to learn how to program Dataflow Engines (DFEs). Financial analytics experts are learning how to program their DFEs themselves — putting power back into the hands of financial experts, without the need for help from external sources. That is very exciting!
Devin Graham, Senior Risk Advisor, Maxeler Technologies
Devin Graham, former partner and Chief Risk Officer at a multi-billion dollar hedge fund has spent his entire career in the financial services industry, managing risk, technology and businesses for large hedge funds and leading investment banks.
As Chief Risk Officer, Devin established and chaired the risk committee, was a member of the executive committee and investor relations management team. During his tenure, the fund achieved market leading returns with minimal return volatility.
Previously, Devin developed and managed multiple new technology driven businesses at a leading investment bank including Prime Brokerage, Derivative Investor Products, and Risk Analytics.
Devin received his B.S. in Biomechanical Engineering from MIT
Follow us on Twitter: @odbmsorg
“To be competitive with non-open-source cloud deployment options, open source databases need to invest in “ease-of-use.” There is no tolerance for complexity in many development teams as we move to “ops-less” deployment models.” –Peter Zaitsev
I have interviewed Peter Zaitsev, Co-Founder and CEO of Percona.
In this interview, Peter talks about the Open Source Databases market; the Cloud; the scalability challenges at Facebook; compares MySQL, MariaDB, and MongoDB; and presents Percona’s contribution to the MySQL and MongoDB ecosystems.
Q1. What are the main technical challenges in obtaining application scaling?
Peter Zaitsev: When it comes to scaling, there are different types. There is a Facebook/Google/Alibaba/Amazon scale: these giants are pushing boundaries, and usually are solving very complicated engineering problems at a scale where solutions aren’t easy or known. This often means finding edge cases that break things like hardware, operating system kernels and the database. As such, these companies not only need to build a very large-scale infrastructures, with a high level of automation, but also ensure it is robust enough to handle these kinds of issues with limited user impact. A great deal of hardware and software deployment practices must to be in place for such installations.
While these “extreme-scale” applications are very interesting and get a lot of publicity at tech events and in tech publications, this is a very small portion of all the scenarios out there. The vast majority of applications are running at the medium to high scale, where implementing best practices gets you the scalability you need.
When it comes to MySQL, perhaps the most important question is when you need to “shard.” Sharding — while used by every application at extreme scale — isn’t a simple “out-of-the-box” feature in MySQL. It often requires a lot of engineering effort to correctly implement it.
While sharding is sometimes required, you should really examine whether it is necessary for your application. A single MySQL instance can easily handle hundreds of thousands per second (or more) of moderately complicated queries, and Terabytes of data. Pair that with MemcacheD or Redis caching, MySQL Replication or more advanced solutions such as Percona XtraDB Cluster or Amazon Aurora, and you can cover the transactional (operational) database needs for applications of a very significant scale.
Besides making such high-level architecture choices, you of course need to also ensure that you exercise basic database hygiene. Ensure that you’re using the correct hardware (or cloud instance type), the right MySQL and operating system version and configuration, have a well-designed schema and good indexes. You also want to ensure good capacity planning, so that when you want to take your system to the next scale and begin to thoroughly look at it you’re not caught by surprise.
Peter Zaitsev: The Facebook Team is the most qualified to answer this question. However, I imagine that at Facebook scale being efficient is very important because it helps to drive the costs down. If your hot data is in the cache when it is important, your database is efficient at handling writes — thus you want a “write-optimized engine.”
If you use Flash storage, you also care about two things:
- – A high level of compression since Flash storage is much more expensive than spinning disk.
– You are also interested in writing as little to the storage as possible, as the more you write the faster it wears out (and needs to be replaced).
RocksDB and MyRocks are able to achieve all of these goals. As an LSM-based storage engine, writes (especially Inserts) are very fast — even for giant data sizes. They’re also much better suited for achieving high levels of compression than InnoDB.
This Blog Post by Mark Callaghan has many interesting details, including this table which shows MyRocks having better performance, write amplification and compression for Facebook’s workload than InnoDB.
Q3. Beringei is Facebook’s open source, in-memory time series database. According to Facebook, large-scale monitoring systems cannot handle large-scale analysis in real time because the query performance is too slow. What is your take on this?
Peter Zaitsev: Facebook operates at extreme scale, so it is no surprise the conventional systems don’t scale well enough or aren’t efficient enough for Facebook’s needs.
I’m very excited Facebook has released Beringei as open source. Beringei itself is a relatively low-end storage engine that is hard to use for a majority of users, but I hope it gets integrated with other open source projects and provides a full-blown high-performance monitoring solution. Integrating it with Prometheus would be a great fit for solutions with extreme data ingestion rates and very high metric cardinality.
Q4. How do you see the market for open source databases evolving?
Peter Zaitsev: The last decade has seen a lot of open source database engines built, offering a lot of different data models, persistence options, high availability options, etc. Some of them were build as open source from scratch, while others were released as open source after years of being proprietary engines — with the most recent example being CMDB2 by Bloomberg. I think this heavy competition is great for pushing innovation forward, and is very exciting! For example, I think if that if MongoDB hadn’t shown how many developers love a document-oriented data model, we might never of seen MySQL Document Store in the MySQL ecosystem.
With all this variety, I think there will be a lot of consolidation and only a small fraction of these new technologies really getting wide adoption. Many will either have niche deployments, or will be an idea breeding ground that gets incorporated into more popular database technologies.
I do not think SQL will “die” anytime soon, even though it is many decades old. But I also don’t think we will see it being the dominant “database” language, as it has been since the turn of millennia.
The interesting disruptive force for open source technologies is the cloud. It will be very interesting for me to see how things evolve. With pay-for-use models of the cloud, the “free” (as in beer) part of open source does not apply in the same way. This reduces incentives to move to open source databases.
To be competitive with non-open-source cloud deployment options, open source databases need to invest in “ease-of-use.” There is no tolerance for complexity in many development teams as we move to “ops-less” deployment models.
Q5. In your opinion what are the pros and cons of MySQL vs. MariaDB?
Peter Zaitsev: While tracing it roots to MySQL, MariaDB is quickly becoming a very different database.
It implements some features MySQL doesn’t, but also leaves out others (MySQL Document Store and Group Replication) or implements them in a different way (JSON support and Replication GTIDs).
From the MySQL side, we have Oracle’s financial backing and engineering. You might dislike Oracle, but I think you agree they know a thing or two about database engineering. MySQL is also far more popular, and as such more battle-tested than MariaDB.
MySQL is developed by a single company (Oracle) and does not have as many external contributors compared to MariaDB — which has its own pluses and minuses.
MySQL is “open core,” meaning some components are available only in the proprietary version, such as Enterprise Authentication, Enterprise Scalability, and others. Alternatives for a number of these features are available in Percona Server for MySQL though (which is completely open source). MariaDB Server itself is completely open source, through there are other components that aren’t that you might need to build a full solution — namely MaxScale.
Another thing MariaDB has going for it is that it is included in a number of Linux distributions. Many new users will be getting their first “MySQL” experience with MariaDB.
For additional insight into MariaDB, MySQL and Percona Server for MySQL, you can check out this recent article
Q6. What’s new in the MySQL and MongoDB ecosystem?
Peter Zaitsev: This could be its own and rather large article! With MySQL, we’re very excited to see what is coming in MySQL 8. There should be a lot of great changes in pretty much every area, ranging from the optimizer to retiring a lot of architectural debt (some of it 20 years old). MySQL Group Replication and MySQL InnoDB Cluster, while still early in their maturity, are very interesting products.
For MongoDB we’re very excited about MongoDB 3.4, which has been taking steps to be a more enterprise ready database with features like collation support and high-performance sharding. A number of these features are only available in the Enterprise version of MongoDB, such as external authentication, auditing and log redaction. This is where Percona Server for MongoDB 3.4 comes in handy, by providing open source alternatives for the most valuable Enterprise-only features.
Q7. Anything else you wish to add?
Peter Zaitsev: I would like to use this opportunity to highlight Percona’s contribution to the MySQL and MongoDB ecosystems by mentioning two of our open source products that I’m very excited about.
First, Percona XtraDB Cluster 5.7.
While this has been around for about a year, we just completed a major performance improvement effort that allowed us to increase performance up to 10x. I’m not talking about improving some very exotic workloads: these performance improvements are achieved in very typical high-concurrency environments!
I’m also very excited about our Percona Monitoring and Management product, which is unique in being the only fully packaged open source monitoring solution specifically built for MySQL and MongoDB. It is a newer product that has been available for less than a year, but we’re seeing great momentum in adoption in the community. We are focusing many of our resources to improving it and making it more effective.
Peter Zaitsev co-founded Percona and assumed the role of CEO in 2006. As one of the foremost experts on MySQL strategy and optimization, Peter leveraged both his technical vision and entrepreneurial skills to grow Percona from a two-person shop to one of the most respected open source companies in the business. With more than 150 professionals in 29 countries, Peter’s venture now serves over 3000 customers – including the “who’s who” of Internet giants, large enterprises and many exciting startups. Percona was named to the Inc. 5000 in 2013, 2014, 2015 and 2016.
Peter was an early employee at MySQL AB, eventually leading the company’s High Performance Group. A serial entrepreneur, Peter co-founded his first startup while attending Moscow State University where he majored in Computer Science. Peter is a co-author of High Performance MySQL: Optimization, Backups, and Replication, one of the most popular books on MySQL performance. Peter frequently speaks as an expert lecturer at MySQL and related conferences, and regularly posts on the Percona Data Performance Blog. He has also been tapped as a contributor to Fortune and DZone, and his recent ebook Practical MySQL Performance Optimization Volume 1 is one of percona.com’s most popular downloads.
follow us on Twitter: @odbmsorg
“Gaia continues to be a challenging mission in all areas even after 4 years of operation.
In total we have processed almost 800 Billion (=800,000 Million) astrometric, 160 Billion (=160,000 Million) photometric and more than 15 Billion spectroscopic observation which is the largest astronomical dataset from a science space mission until the present day.”
— Uwe Lammers.
In December of 2013, the European Space Agency (ESA) launched a satellite called Gaia on a five-year mission to map the galaxy and learn about its past. The Gaia mission is considered by the experts “the biggest data processing challenge to date in astronomy”.
I recall here the Objectives of the Gaia Project (source ESA Web site):
“To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.”
I have been following the GAIA mission since 2011, and I have reported it in three interviews until now.
In this interview, Uwe Lammers – Gaia’s Science Operations Manager – gives a very detailed description of the data challenges and the opportunities of the Gaia mission.
This interview is the fourth of the series, the second after the launch.
Q1. Of the raw astrometry, photometry and spectroscopy data collected so far by the Gaia spacecraft, what is their Volume, Velocity, Variety, Veracity and Value?
Since the beginning of the nominal mission in 2014 until end June 2017 the satellite has delivered about 47.5 TB compressed raw data. This data is not suitable for any scientific analysis but first has to be processed into higher-level products which inflates the volume about 4 times.
The average raw daily data rate is about 40 GB but highly variable depending on which part of the sky the satellite is currently scanning through. The data is highly-complex and interdependent but not unstructured – it does not come with a lot of meta-information as such but follows strictly defined structures. In general it is very trustworthy, however, the downstream
data processing cannot blindly assume that every single observation is valid.
As with all scientific measurements, there can be outliers which must be identified and eliminated from the data stream as part of the analysis. Regarding value, Gaia’s data set is absolutely unique in a number of ways.
Gaia is the only mission surveying the complete sky with unprecedented precision and completeness. The end results is expected to be a treasure trove for generations of astronomers to come.
Q2 How is this data transmitted to Earth?
Under normal observing conditions the data is transmitted from the satellite to the ground through a so-called phased-array-antenna (PAA) at a rate of up to 8.5 Mbps. As the satellite spins, it continuously keeps a radio beam directed towards the Earth by activating successive panels on the PAA. This is a fully electronic process as there can be no moving parts on Gaia which would otherwise disturb the precise measurements. On the Earth we use three 35m radio dishes in Spain, Australia, and Argentina to receive the telemetry from Gaia.
Q3. Calibrated processed data, high level data products and raw data. What is the difference? What kind of technical data challenges do they each pose?
That question is not easy to answer in a few words. Raw data are essentially unprocessed digital measurements from the CCDs – perhaps comparable to data from the “raw mode” of digital consumer cameras. They have to be processed with a range of complex software to turn it into higher level products from which at the end astrophysical information can be inferred. There are many technical challenges, the most basic one is still to handle the 100s of GBs of daily data. Handling means, reception, storage, processing, I/O by the scientific algorithms, backing-up, and disseminating the processed data to 5 other partner data processing centres across Europe.
Here at the Science Operations Centre (SOC) near Madrid we have chosen years ago InterSystems Caché RDMS + NetApp hardware as our storage solution and this continues to be a good solution. The system is reliable and performant which are crucial pre-requisites for us. Another technical challenge is data accountability which means to keep track of the more than 70 Mio scientific observation we get from the satellite every single day.
Q4. Who are the users for such data and what they do with it?
The data we are generating here at the SOC has no immediate users. It is sent out to the 5 other Gaia Data Processing Centres where more scientific processing takes place and more higher-level products get created. From all this processed data we are constructing a stellar catalogue which is our final result and this is what the end users – the astronomical community of world – to see. The first version of our catalogue was published 14 September last year (Gaia Data Release 1) and we are currently working hard to release the second version (DR2) in April next year.
Our end users do fundamental astronomical research with the data ranging from looking at individual stars, studies of clusters, dynamics of our Milky-Way to cosmological questions like the expansion rate of our universe. The scientific exploitation of the Gaia data has just started but already now more than 200 scientific articles have been published. This is about 1 per day since DR1 and we expect this rate to go higher up after DR2.
Q5. Can you explain at a high level how is the ground processing of Gaia data implemented?
ESA has entrusted the Gaia data processing to the Data Processing Analysis Consortium (DPAC) which the SOC is an integral part of. DPAC consists of 9 so-called Coordination Units (CU) and 6 data processing centres (DPCs) across Europe, so this is a large distributed system.
In total some 450 people from 20+ countries with a large range of educational backgrounds and experiences are forming DPAC. Roughly speaking, the CUs are responsible for writing and validating the scientific processing software which is then run in one of the DPCs (every CU is associated with exactly one DPC).
The different CUs cover different aspects of the data processing (e.g. CU3 takes care of astrometry, CU5 of photometry).
The corresponding processes run more or less independent of each other, however, due to the complex interdependencies of the Gaia data itself this is only a first approximation. Ultimately, everything depends on everything else (e.g. astrometry depends on photometry and vice versa) which means that the entire system has to be iterated to produce the final solution. As you can imagine a lot of data has the be exchanged. SOC/DPCE is the hub in a hub-and-spokes topology where the other 5 DPCs are sitting at the ends of the spokes. No data exchange between DPCs is allowed but all the data flow is centrally managed through the hub at DPCE.
Q6. How do you process the data stream in near real-time in order to provide rapid alerts to facilitate ground-base follow up?
Yes, indeed we do. For ground-based follow up observations of variable objects quick turn-around times are essential. The time difference between an observation made on-board and the confirmation of a photometric alert on the ground is typically 2 days now which is close to the optimal value given all the operational constraints we have.
Q7. What are the main technical challenges with respect to data processing, manipulation and storage you have encountered so far? and how did you solved them?
Regarding storage, the handling of 100s of GBs of raw and processed data every day has always been and remains until today quite a challenge as explained above. The Gaia data reduction task is also a formidable computational problem. Years ago we estimated the total numerical effort to produce the final catalogue at some 10^20 FLOPs and this has proven fairly accurate.
So we need quite some number-crunching capabilities in the DPCs and to continuously expand CPU resources as the data volume keeps growing in the operational phase of the mission. Moore’s law is slowly coming to an end but, fortunately, a number of algorithms are perfectly parallelizable (processing every object in the sky individually and isolated) such that CPU bottlenecks can be ameliorated by simply adding more processors to the existing systems.
Data transfers are likewise a challenge. At the moment 1 Gbps connections (public Internet) between DPCE and the other 5 DPCs are sufficient, however, in the coming years we heavily rely on seeing bandwidths increasing to 10 Gbps and beyond. Unfortunately, this is largely not under our control which is a risk to the project.
Q8. What kind of databases and analytics tools do you use for the Gaia`s data pipeline?
As explained above, for the so-called daily pipeline we have chosen InterSystems Caché and are very satisfied with this approach. We had some initial problems with the system but were able to overcome all difficulties with the help of Intersystems. We much appreciated their excellent service and customer orientation in this phase and till the present day. Regarding analytics tools we use most facilities that are part of Caché, but have also developed a suite of custom-made solutions.
Q9. How do you transform the raw information into useful and reliable stellar positions?
The raw data from the satellite is first turned into higher level-products which already includes preliminary estimates for the stellar positions. But each of these positions is then only based on a single measurements. The high accuracy of Gaia comes from combining _all_ observations that have been taken during the mission with a scheme called Astrometric Iterative Solution (AGIS) [see The astrometric core solution for the Gaia mission. Overview of models, algorithms, and software implementation].
This cannot be done on a star-by-star basis but is a global, simultaneous optimization of a large number of parameters including the 5 basic astrometric parameters of each star (about 1 Billion in total), the time-varying attitude of the satellite
(a few Million), and a number of calibration parameters (a few 10.000).
The process is iterative and in the end gives the best match between the model parameters and the actual observations. The stellar positions are two of the five astrometric parameter of each object.
Q10. What is the level of accuracy you have achieved so far?
The accuracies depend on the brightnesses of the stars – the brighter a star, the higher is the achievable accuracy. In DR1 the typical uncertainty is about 0.3 mas for the positions and parallaxes, and about 1 mas yr^-1 for the proper motions.
For positions and parallaxes a systematic component of another 0.3 mas should be added. With DR2 we are aiming to reduce these formal errors by at least a factor 3 and likewise eliminate systematic errors by the same or a larger amount.
Q11. The first catalogue of more than a billion stars from ESA’s Gaia satellite was published on 14 September 2016 – the largest all-sky survey of celestial objects to date. What data is in this catalog? What is the size and structure of the information you analysed so far?
Gaia DR1 contains astrometry, G-band photometry (brightnesses), and a modest number of variable star light curves, for a total of 1 142 679 769 sources [See Gaia Data Release 1. Summary of the astrometric, photometric, and survey properties]. For the large majority of those we only provide position and magnitude but about 2 Million stars also have parallaxes and proper motions. In DR2 these numbers will be substantially larger.
The information is structured in simple, easy-to-use tables which can be queried via the central Gaia Archive and a number of other data centres around the world.
Q12. What insights have been derived so far by analysing this data?
The astronomical community eagerly grabbed the DR1 data and since 14 September a couple of hundred scientific articles have appeared in peer-reviewed astronomical journals covering a large breads of topics.
Only to give one example: A new so-called open cluster of stars was discovered very close to the brightest star in the night sky, Sirius. All previous surveys had missed it!
Q13 How do you offer a proper infrastructure and middleware upon which scientists will be able to do exploration and modeling with this huge data set?
That is a very good question! At the moment the archive system does not allow yet real big data-mining using the entire large Gaia data set. Up to know we do not know precisely yet what scientists will want to do with the Gaia data in the end.
There is the “traditional” astronomical research which mostly uses only subsets of the data, e.g. all stars in a particular area of the sky. Such data requests can be satisfied with traditional queries to a RDBMS.
But in the future we expect also applications which will need data mining capabilities and we are experimenting with a number of different approaches using the “code-to-the-data” paradigm. The idea is that scientists will be able to upload and deploy their codes directly through a platform which allows execution with quick data access close to the archive.
For DR2 this will only be available for DPAC-internal use but, depending on experiences gained, as per DR3 it might become a service for public use. One technology we are looking at is Apache Spark for big data mining.
Q14. What software technologies do you use for accessing the Gaia catalogue and associated data?
As explained above, at the moment we are offering access to the catalogue only through a traditional RDBMS system which allows queries to be submitted in a special SQL dialect called ADQL (Astronomical Data Query Language). This DB system is not using InterSystems Caché but Postgres.
Q15. In addition to the query access, how do you “visualize” such data? Which “big data” techniques do you use for histograms production?
Visualization is done with a special custom-made application that sits close to the archive and is using not the raw data but pre-computed special objects especially constructed for fast visualization. We are not routinely using any big data techniques but are experimenting with a few key concepts.
For visualization one interesting novel application is called vaex and we are looking at it.
Histogramming of the entire data set is likewise done using pre-canned summary statistics which was generated when the data was ingested into the archive. The number of users really wanting the entire data set and this kind of functionality is very limited at the moment. We as well as the scientific community are still learning what can be done with the Gaia data set.
Q16. Which “big data” software and hardware technologies did use so far? And what are the lessons learned?
Again, we are only starting to look into big data technologies that may be useful for us. Until now most of the effort has gone into robustifying all systems and prepare DR1 and now DR2 for April next year. One issue is always that the Gaia data is so peculiar and special that COTS solutions rarely work. Most of the software systems we use are special developments.
Q.17 What are the main technical challenges ahead?
As far as the daily systems are concerned we are now finally in the routine phase. The main future challenges lie in robustifying and validating the big outer iterative loop that I described above. It has not been tested yet, so, we are executing it for the first time with real flight data.
Producing DR3 (mid to late 2020) will be a challenge as this for the first time involves output from all CUs and the results from the outer iterative loop. DR4 around end 2022 is then the final release for the nominal mission and for that we want to release “everything”. This means also the individual observation data (“epoch data”) which will inflate the total volume served by the archive by a factor 100 or so.
Qx Anything else you wish to add?
Gaia continues to be a challenging mission in all areas even after 4 years of operation. In total we have processed almost 800 Billion (=800,000 Million) astrometric, 160 Billion (=160,000 Million) photometric and more than 15 Billion spectroscopic observation which is the largest astronomical dataset from a science space mission until the present day.
Gaia is fulfilling its promises in every regard and the scientific community is eagerly looking into what is available already now and the coming data releases. This continues to be a great source of motivation for everybody working on this great mission.
Uwe Lammers. My academic background is in physics and computer science. After my PhD I joined ESA to first work on the X-ray missions EXOSAT, Beppo-SAX, and XMM-Newton before getting interested in Gaia in 2005. The first years I led the development of the so-called Astrometric Global Iterative Solution (AGIS) system and then became Gaia’s Science Operations Manager in 2014.
– The astrometric core solution for the Gaia mission. Overview of models, algorithms, and software implementation
L. Lindegren, U. Lammers et al. Astronomy & Astrophysics, Volume 538, id.A78, 47 pp. February 2012, DOI: 10.1051/0004-6361/201117905
– Gaia Data Release 1. Summary of the astrometric, photometric, and survey properties A.G.A. Brown and Gaia Collaboration, Astronomy & Astrophysics, Volume 595, id.A2, 23 pp. November 2016, DOI: 10.1051/0004-6361/201629512
– Gaia Data Release 1. Astrometry: one billion positions, two million proper motions and parallaxes L. Lindegren, U. Lammers, et al. Astronomy & Astrophysics, Volume 595, id.A4, 32 pp. November 2016, DOI: 10.1051/0004-6361/201628714
– The Gaia mission in 2015. Interview with Uwe Lammers and Vik Nagjee , ODBMS Industry Watch, March 24, 2015
– The Gaia mission, one year later. Interview with William O’Mullane. ODBMS Industry Watch, January 16, 2013
– Objects in Space vs. Friends in Facebook. ODBMS Industry Watch, April 13, 2011
– Objects in Space. ODBMS Industry Watch, February 14, 2011
Follow us on Twitter: @odbmsorg
“Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.”–Nikita Ivanov.
I have interviewed Nikita Ivanov,CTO of GridGain.
Main topics of the interview are Apache Ignite, Apache Spark and MySQL, and how well they perform on big data analytics.
Q1. What are the main technical challenges of SaaS development projects?
Nikita Ivanov: SaaS requires that the applications be highly responsive, reliable and web-scale. SaaS development projects face many of the same challenges as software development projects including a need for stability, reliability, security, scalability, and speed. Speed is especially critical for modern businesses undergoing the digital transformation to deliver real-time services to their end users. These challenges are amplified for SaaS solutions which may have hundreds, thousands, or tens of thousands of concurrent users, far more than an on-premise deployment of enterprise software.
Fortunately, in-memory computing offers SaaS developers solutions to the challenges of speed, scale and reliability.
Q2. In your opinion, what are the limitations of MySQL® when it comes to big data analytics?
Nikita Ivanov: MySQL was originally designed as a single-node system and not with the modern data center concept in mind. MySQL installations cannot scale to accommodate big data using MySQL on a single node. Instead, MySQL must rely on sharding, or splitting a data set over multiple nodes or instances, to manage large data sets. However, most companies manually shard their database, making the creation and maintenance of their application much more complex. Manually creating an application that can then perform cross-node SQL queries on the sharded data multiplies the level of complexity and cost.
MySQL was also not designed to run complicated queries against massive data sets. MySQL optimizer is quite limited, executing a single query at a time using a single thread. A MySQL query can neither scale among multiple CPU cores in a single system nor execute distributed queries across multiple nodes.
Q3. What solutions exist to enhance MySQL’s capabilities for big data analytics?
Nikita Ivanov: For companies which require real-time analytics, they may attempt to manually shard their database. Tools such as Vitess, a framework YouTube released for MySQL sharding, or ProxySQL are often used to help implement sharding.
To speed up queries, caching solutions such as Memcached and Redis are often deployed.
Many companies turn to data warehousing technologies. These solutions require ETL processes and a separate technology stack which must be deployed and managed. There are many external solutions, such as Hadoop and Apache Spark, which are quite popular. Vertica and ClickHouse have also emerged as analytics solutions for MySQL.
Apache Ignite offers speed, scale and reliability because it was built from the ground up as a high performant and highly scalable distributed in-memory computing platform.
In contrast to the MySQL single-node design, Apache Ignite automatically distributes data across nodes in a cluster eliminating the need for manual sharding. The cluster can be deployed on-premise, in the cloud, or in a hybrid environment. Apache Ignite easily integrates with Hadoop and Spark, using in-memory technology to complement these technologies and achieve significantly better performance and scale. The Apache Ignite In-Memory SQL Grid is highly optimized and easily tuned to execute high performance ANSI-99 SQL queries. The In-Memory SQL Grid offer access via JDBC/ODBC and the Ignite SQL API for external SQL commands or integration with analytics visualization software such as Tableau.
Q4. What is exactly Apache® Ignite™?
Nikita Ivanov: Apache Ignite is a high-performance, distributed in-memory platform for computing and transacting on large-scale data sets in real-time. It is 1,000x faster than systems built using traditional database technologies that are based on disk or flash technologies. It can also scale out to manage petabytes of data in memory.
Apache Ignite includes the following functionality:
· Data grid – An in-memory key value data cache that can be queried
· SQL grid – Provides the ability to interact with data in-memory using ANSI SQL-99 via JDBC or ODBC APIs
· Compute grid – A stateless grid that provides high-performance computation in memory using clusters of computers and massive parallel processing
· Service grid – A service grid in which grid service instances are deployed across the distributed data and compute grids
· Streaming analytics – The ability to consume an endless stream of information and process it in real-time
· Advanced clustering – The ability to automatically discover nodes, eliminating the need to restart the entire cluster when adding new nodes
Q5. How Apache Ignite differs from other in-memory data platforms?
Nikita Ivanov: Most in-memory computing solutions fall into one of three types: in-memory data grids, in-memory databases, or a streaming analytics engine.
Apache Ignite is a full-featured in-memory computing platform which includes an in-memory data grid, in-memory database capabilities, and a streaming analytics engine. Furthermore, Apache Ignite supports distributed ACID compliant transactions and ANSI SQL-99 including support for DML and DDL via JDBC/ODBC.
Q6. Can you use Apache® Ignite™ for Real-Time Processing of IoT-Generated Streaming Data?
Nikita Ivanov: Yes, Apache Ignite can ingest and analyze streaming data using its streaming analytics engine which is built on a high-performance and scalable distributed architecture. Because Apache Ignite natively integrates with Apache Spark, it is also possible to deploy Spark for machine learning at in-memory computing speeds.
Apache Ignite supports both high volume OLTP and OLAP use cases, supporting Hybrid Transactional Analytical Processing (HTAP) use cases, while achieving performance gains of 1000x or greater over systems which are built on disk-based databases.
Q7. How do you stream data to an Apache Ignite cluster from embedded devices?
Nikita Ivanov: It is very easy to stream data to an Apache Ignite cluster from embedded devices.
The Apache Ignite streaming functionality allows for processing never-ending streams of data from embedded devices in a scalable and fault-tolerant manner. Apache Ignite can handle millions of events per second on a moderately sized cluster for embedded devices generating massive amounts of data.
Q8. Is this different then using Apache Kafka?
Nikita Ivanov: Apache Kafka is a distributed streaming platform that lets you publish and subscribe to data streams. Kafka is most commonly used to build a real-time streaming data pipeline that reliably transfers data between applications. This is very different from Apache Ignite, which is designed to ingest, process, analyze and store streaming data.
Q9. How do you conduct real-time data processing on this stream using Apache Ignite?
Nikita Ivanov: Apache Ignite includes a connector for Apache Kafka so it is easy to connect Apache Kafka and Apache Ignite. Developers can either push data from Kafka directly into Ignite’s in-memory data cache or present the streaming data to Ignite’s streaming module where it can be analyzed and processed before being stored in memory.
This versatility makes the combination of Apache Kafka and Apache Ignite very powerful for real-time processing of streaming data.
Q10. Is this different then using Spark Streaming?
Nikita Ivanov: Spark Streaming enables processing of live data streams. This is merely one of the capabilities that Apache Ignite supports. Although Apache Spark and Apache Ignite utilize the power of in-memory computing, they address different use cases. Spark processes but doesn’t store data. It loads the data, processes it, then discards it. Ignite, on the other hand, can be used to process data and it also provides a distributed in-memory key-value store with ACID compliant transactions and SQL support.
Spark is also for non-transactional, read-only data while Ignite supports non-transactional and transactional workloads. Finally, Apache Ignite also supports purely computational payloads for HPC and MPP use cases while Spark works only on data-driven payloads.
Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.
Qx. Is there anything else you wish to add?
Nikita Ivanov: The world is undergoing a digital transformation which is driving companies to get closer to their customers. This transformation requires that companies move from big data to fast data, the ability to gain real-time insights from massive amounts of incoming data. Whether that data is generated by the Internet of Things (IoT), web-scale applications, or other streaming data sources, companies must put architectures in place to make sense of this river of data. As companies make this transition, they will be moving to memory-first architectures which ingest and process data in-memory before offloading to disk-based datastores and increasingly will be applying machine learning and deep learning to make understand the data. Apache Ignite continues to evolve in directions that will support and extend the abilities of memory-first architectures and machine learning/deep learning systems.
Nikita IvanovFounder & CTO, GridGain,
Nikita Ivanov is founder of Apache Ignite project and CTO of GridGain Systems, started in 2007. Nikita has led GridGain to develop advanced and distributed in-memory data processing technologies – the top Java in-memory data fabric starting every 10 seconds around the world today. Nikita has over 20 years of experience in software application development, building HPC and middleware platforms, contributing to the efforts of other startups and notable companies including Adaptec, Visa and BEA Systems. He is an active member of Java middleware community, contributor to the Java specification. He’s also a frequent international speaker with over two dozen of talks on various developer conferences globally.
Follow ODBMS.org on Twitter: @odbmsorg
” I like the idea behind programmable, communicating devices and I believe there is great potential for useful applications. At the same time, I am extremely concerned about the safety, security and privacy of such devices.” –Vint G. Cerf
I had the pleasure to interview Vinton G. Cerf. Widely known as one of the “Fathers of the Internet,” Cerf is the co-designer of the TCP/IP protocols and the architecture of the Internet. Main topic of the interview is the Internet of Things (IoT) and its challenges, especially the safety, security and privacy of IoT devices.
Vint is currently Chief Internet Evangelist for Google.
Q1. Do you like the Internet of Things (IoT)?
Vint Cerf: This question is far too general to answer. I like the idea behind programmable, communicating devices and I believe there is great potential for useful applications. At the same time, I am extremely concerned about the safety, security and privacy of such devices. Penetration and re-purposing of these devices can lead to denial of service attacks (botnets), invasion of privacy, harmful dysfunction, serious security breaches and many other hazards. Consequently the makers and users of such devices have a great deal to be concerned about.
Q2. Who is going to benefit most from the IoT?
Vint Cerf: The makers of the devices will benefit if they become broadly popular and perhaps even mandated to become part of local ecosystem. Think “smart cities” for example. The users of the devices may benefit from their functionality, from the information they provide that can be analyzed and used for decision-making purposes, for example. But see Q1 for concerns.
Q3. One of the most important requirement for collections of IoT devices is that they guarantee physical safety and personal security. What are the challenges from a safety and privacy perspective that the pervasive introduction of sensors and devices pose? (e.g. at home, in cars, hospitals, wearables and ingestible, etc.)
Vint Cerf: Access control and strong authentication of parties authorized to access device information or control planes will be a primary requirement. The devices must be configurable to resist unauthorized access and use. Putting physical limits on the behavior of programmable devices may be needed or at least advisable (e.g., cannot force the device to operate outside of physically limited parameters).
Q5. Consumers want privacy. With IoT physical objects in our everyday lives will increasingly detect and share observations about us. How is it possible to reconcile these two aspects?
Vint Cerf: This is going to be a tough challenge. Videocams that help manage traffic flow may also be used to monitor individuals or vehicles without their permission or knowledge, for example (cf: UK these days). In residential applications, one might want (insist on) the ability to disable the devices manually, for example. One would also want assurances that such disabling cannot be defeated remotely through the software.
Q6. Let`s talk about more about security. It is reported that badly configured “smart devices” might provide a backdoor for hackers. What is your take on this?
Vint Cerf: It depends on how the devices are connected to the rest of the world. A particularly bad scenario would have a hacker taking over the operating system of 100,000 refrigerators. The refrigerator programming could be preserved but the hacker could add any of a variety of other functionality including DDOS capacity, virus/worm/Trojan horse propagation and so on.
One might want the ability to monitor and log the sources and sinks of traffic to/from such devices to expose hacked devices under remote control, for example. This is all a very real concern.
Q7. What measures can be taken to ensure a more “secure” IoT?
Vint Cerf: Hardware to inhibit some kinds of hacking (e.g. through buffer overflows) can help. Digital signatures on bootstrap programs checked by hardware to inhibit boot-time attacks. Validation of software updates as to integrity and origin. Whitelisting of IP addresses and identifiers of end points that are allowed direct interaction with the device.
Q8. Is there a danger that IoT evolves into a possible enabling platform for cyber-criminals and/or for cyber war offenders?
Vint Cerf: There is no question this is already a problem. The DYN Corporation DDOS attack was launched by a botnet of webcams that were readily compromised because they had no access controls or well-known usernames and passwords. This is the reason that companies must feel great responsibility and be provided with strong incentives to limit the potential for abuse of their products.
Q9. What are your personal recommendations for a research agenda and policy agenda based on advances in the Internet of Things?
Vint Cerf: Better hardware reinforcement of access control and use of the IOT computational assets. Better quality software development environments to expose vulnerabilities before they are released into the wild. Better software update regimes that reduce barriers to and facilitate regular bug fixing.
Q10. The IoT is still very much a work in progress. How do you see the IoT evolving in the near future?
Vint Cerf: Chaotic “standardization” with many incompatible products on the market. Many abuses by hackers. Many stories of bugs being exploited or serious damaging consequences of malfunctions. Many cases of “one device, one app” that will become unwieldy over time. Dramatic and positive cases of medical monitoring that prevents serious medical harms or signals imminent dangers. Many experiments with smart cities and widespread sensor systems.
Many applications of machine learning and artificial intelligence associated with IOT devices and the data they generate. Slow progress on common standards.
Vinton G. Cerf co-designed the TCP/IP protocols and the architecture of the Internet and is Chief Internet Evangelist for Google. He is a member of the National Science Board and National Academy of Engineering and Foreign Member of the British Royal Society and Swedish Royal Academy of Engineering, and Fellow of ACM, IEEE, AAAS, and BCS.
Cerf received the US Presidential Medal of Freedom, US National Medal of Technology, Queen Elizabeth Prize for Engineering, Prince of Asturias Award, Japan Prize, ACM Turing Award, Legion d’Honneur and 29 honorary degrees.
Follow us on Twitter: @odbsmorg
“Looking across the senior leadership in Government, very few top Civil Servants and Ministers have come from a technical background. Most Departments will have a CTO/CIO person who may or may not also been drawn from a relevant technical background. Those that are drawn from such a background and are empowered by their senior leadership, deliver a clear advantage to their organisation.”–Sarbjit Singh Bakhshi.
I have interviewed Sarbjit Singh Bakhshi, Director of Government Affairs at Maxeler Technologies. We covered in the interview the challenges and opportunities for the UK public sector in the Post-Brexit Era.
Q1. It has been suggested that some of the key challenges in government IT are: i) change aversion, ii) lack of technocratic leadership, and iii) processes that don’t scale down. What is your take on this?
Sarbjit Singh Bakhshi: Looking across the senior leadership in Government, very few top Civil Servants and Ministers have come from a technical background. Most Departments will have a CTO/CIO person who may or may not also been drawn from a relevant technical background. Those that are drawn from such a background and are empowered by their senior leadership, deliver a clear advantage to their organisation.
Where these people are not empowered, you will often find bad technical choices made by Departments that seem to be driven by short term commercial gain rather than long term interests. In the worst cases, they’d rather patch up systems knowing they will fail in the medium term than invest in a long term solution. This leads to rather inevitable problems when they can no longer pursue this strategy.
There are also issues in terms of understanding the total cost of computing. The cost of inefficient systems that consume vast amounts of electricity are often hidden from decision makers as running costs come out of an operational budget for the Department. So the decision makers will focus on a low purchase price for some systems even if the systems cost more in the long run to operate.
There needs to be a greater understanding of the challenges of running Government IT and there should be changes in leadership to support this. A focus on total cost of ownership (including transition costs) needs to be taken into account.
Q2. What are the main barriers to use new and innovative technologies in the UK Government?
Sarbjit Singh Bakhshi: Like most advanced countries that have been running big IT programmes from the 1950’s, the UK Government has a very heterogeneous IT estate and compatibility with older and archaic systems can still be a problem. There are further issues in terms of the threat of cyber warfare that need to be considered also when delivering new systems into this environment.
There are also issues around procurement which can stifle innovative and iterative approaches to technology deployment.
The recent move to create the G-Cloud does help in this respect, but we still see too many OJEU procedures which often crowd out smaller companies from the opportunities of working with Government.
Q3. How will Brexit affect the UK’s tech industry?
Sarbjit Singh Bakhshi: As Article 50 has just been invoked and nothing has been agreed it is a little too soon to tell.
Obviously, there are concerns in the UK’s Tech Industry around getting and recruiting the best staff from around the world in the UK, access to the Digital Single Market and any agreements the EU has around the safe storage of data. I’m sure these will all be top of mind for British negotiators.
Q4. What are the challenges and opportunities for the UK public sector in the Post-Brexit Era?
Sarbjit Singh Bakhshi: The challenges are manifold, the applicability of EU laws and the future of EU citizens in the UK and how that affects the administration of the country are probably paramount.
There are also opportunities for the British tech industry in the creation of parallel Governmental systems to replace the ones we are currently using that are from the EU.
Q5. What were your main motivations to leave the Government and go to Industry?
Sarbjit Singh Bakhshi: After many successful years of operation, Maxeler is emerging into an exciting new space. Maxeler has its first cloud based product with Amazon and is able to offer onsite and/or elastic cloud operations for the first time in its history. With more Government work moving to the cloud, this is an exciting time as we can bring the high levels of Maxeler performance to Government data sets at a reasonable cost.
We have applications in cyber security, big – complex data analysis and real time networking that are essential for Governments to deal with the emerging threats from around the world.
If one considers the wider context of Government – including research and scientific computing, Maxeler also has a good story to tell. We are in STFC Daresbury and can provide exceptionally performant computing at low energy cost for supercomputer environments. Complementing this we have an active university programme with over 150 top universities around the world where scientists are using DataFlow Engines for a variety of computational tasks that are outperforming traditional CPU setups.
Maxeler is ready to offer services to the UK Government as an approved G-Cloud supplier and can use its experience with large data sets to help Government move from on-premises hosting to take advantages of the cloud and its ultra-fast computing in line with the Government’s ‘Cloud First’ technology policy..
Q6. What are your aims and goals as new appointed Director of Government Affairs at Maxeler Technologies?
Sarbjit Singh Bakhshi: Improve our relations with Government in the UK and elsewhere. Maxeler technologies has experience of working with high pressure financial institutions, we’d like to offer Government the same opportunity to deal with its most pressing computational problems in a ultra-performant, energy efficient and easy to manage way.
Sarbjit Bakhshi joined Maxeler Technologies from a long career working for the British Government. Working mainly in areas of European Union Negotiations and Technology policy and promotion, Sarbjit’s appointment marks a new phase for Maxeler Technologies and its commitment to work with Government in the UK and overseas as part of its’ next phase of expansion.
Follow us on Twitter: @odbmsorg