“Debugging AI systems is harder than debugging traditional ones, but not impossible. Mainly it requires a different mindset, that allows for nondeterminism and a partial understanding of what’s going on. Is the problem in the data, the system, or in how the system is being applied to the data? Debugging an AI is more like domesticating an animal than debugging a program.”– Pedro Domingos.
I have interviewed Pedro Domingos, professor of computer science at the University of Washington and the author of “The Master Algorithm“, a bestselling introduction to machine learning for non-specialists. We talked about various topics related to Artificial Intelligence, Machine Learning, and Deep Learning.
Q1. What’s the difference between Artificial Intelligence, Machine Learning, and Deep Learning?
Pedro Domingos: The goal of AI is to get computers to do things that in the past have required human intelligence: commonsense reasoning, problem-solving, planning, decision-making, vision, speech and language understanding, and so on. Machine learning is the subfield of AI that deals with a particularly important ability: learning. Just as in humans the ability to learn underpins all else, so machine learning is behind the growing successes of AI.
Deep learning is a specific type of machine learning loosely based on emulating the brain. Technically, it refers to learning neural networks with many hidden layers, but these days it’s used to refer to all neural networks.
Q2. Several AI scientists around the world would like to make computers learn so much about the world, so rapidly and flexibly, as humans (or even more). How can learned results by machines be physically plausible or be made understandable by us?
Pedro Domingos: The results can be in the form of “if . . . then” rules, decision trees, or other representations that are easy for humans to understand. Some types of models can be visualized. Neural networks are opaque, but other types of model don’t have to be.
Q3. It seems no one really knows how the most advanced AI algorithms do what they do. Why?
Pedro Domingos: Since the algorithms learn from data, it’s not as easy to understand what they do as it would be if they were programmed by us, like traditional algorithms. But that’s the essence of machine learning: that it can go beyond our knowledge to discover new things. A phenomenon may be more complex than a human can understand, but not more complex than a computer can understand. And in many cases we also don’t know what humans do: for example, we know how to drive a car, but we don’t know how to program a car to drive itself. But with machine learning the car can learn to drive by watching video of humans drive.
Q4. That could be a problem. Do you agree?
Pedro Domingos: It’s a disadvantage, but how much of a problem it is depends on the application. If an AI algorithm that predicts the stock market consistently makes money, the fact that it can’t explain how it did it is something investors can live with. But in areas where decisions must be justified, some learning algorithms can’t be used, or at least their results have to be post-processed to give explanations (and there’s lots of research on this).
Q5. Let`s consider an autonomous car that relies entirely on an algorithm that had taught itself to drive by watching a human do it. What if one day the car crashed into a tree, or even worst killed a pedestrian?
Pedro Domingos: If the learning took place before the car was delivered to the customer, the car’s manufacturer would be liable, just as with any other machinery. The more interesting problem is if the car learned from its driver. Did the driver set a bad example, or did the car not learn properly?
Q6. Would it be possible to create some sort of “AI-debugger” that let you see what the code does while making a decision?
Pedro Domingos: Yes, and many researchers are hard at work on this problem. Debugging AI systems is harder than debugging traditional ones, but not impossible. Mainly it requires a different mindset, that allows for nondeterminism and a partial understanding of what’s going on. Is the problem in the data, the system, or in how the system is being applied to the data? Debugging an AI is more like domesticating an animal than debugging a program.
Q7. How can computers learn together with us still in the loop?
Pedro Domingos: In so-called online learning, the system is continually learning and performing, like humans. And in mixed-initiative learning, the human may deliberately teach something to the computer, the computer may ask the human a question, and so on. These types of learning are not widespread in industry yet, but they exist in the lab, and they’re coming.
Q8. Professional codes of ethics do little to change peoples’ behaviour. How is it possible to define incentives for using an ethical approach to software development, especially in the area of AI?
Pedro Domingos: I think ethical software development for AI is not fundamentally different from ethical software development in general. The interesting new question is: when AIs learn by themselves, how do we keep them from gowing astray? Fixed rules of ethics, like Asimov’s three laws of robotics, are too rigid and fail easily. (That’s what his robot stories were about.) But if we just let machines learn ethics by observing and emulating us, they will learn to do lots of unethical things. So maybe AI will force us to confront what we really mean by ethics before we can decide how we want AIs to be ethical.
Q9. Who will control in the future the Algorithms and Big Data that drive AI?
Pedro Domingos: It should be all of us. Right now it is mainly the companies that have lots of data and sophisticated machine learning systems, but all of us – as citizens and professionals and in our personal lives – should become aware of what AI is and what we can do with it. That’s why I wrote “The Master Algorithm”: so everyone can understand machine learning well enough to make the best use of it. How can I use AI to do my job better, to find the things I need, to build a better society? Just like driving a car does not require knowing how the engine works, but it does require knowing how to use the steering wheel and pedals, everyone needs to know how to control an AI system, and to have AIs that work for them and not for others, just like they have cars and TVs that work for them.
Q10. What are your current research projects?
Pedro Domingos: Today’s machine learning algorithms are still very limited compared to humans. In particular, they’re not able to generalize very far from the data.
A robot can learn to pick up a bottle in a hundred trials, but if it then needs to pick up a cup it has to start again from scratch. In contrast, a three-year-old can effortlessly pick anything up.
So I’m working on a new machine learning paradigm, called symmetry-based learning, where the machine learns individual transformations from data that preserve the essential properties of an object, and can then compose the transformations in many different ways to generalize very far from the data. For example, if I rotate a cup it’s still the same cup, and if I replace a word by a synonym in a sentence the meaning of the sentence is unchanged. By composing transformations like this I can arrive at a picture or a sentence that looks nothing like the original, but still means the same.
It’s called symmetry-based learning because the theoretical framework to do this comes from symmetry group theory, an area of mathematics that is also the foundation of modern physics.
Pedro Domingos is a professor of computer science at the University of Washington and the author of “The Master Algorithm”, a bestselling introduction to machine learning for non-specialists.
He is a winner of the SIGKDD Innovation Award, the highest honour in data science, and a Fellow of the Association for the Advancement of Artificial Intelligence. He has received a Fulbright Scholarship, a Sloan Fellowship, the National Science Foundation’s CAREER Award, and numerous best paper awards.
He received his Ph.D. from the University of California at Irvine and is the author or co-author of over 200 technical publications. He has held visiting positions at Stanford, Carnegie Mellon, and MIT. He co-founded the International Machine Learning Society in 2001. His research spans a wide variety of topics in machine learning, artificial intelligence, and data science, including scaling learning algorithms to big data, maximizing word of mouth in social networks, unifying logic and probability, and deep learning.
The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. New York: Basic Books, 2015.
What’s Missing in AI: The Interface Layer. In P. Cohen (ed.), Artificial Intelligence: The First Hundred Years. Menlo Park, CA: AAAI Press. To appear.
How Not to Regulate the Data Economy. Medium, 2018.
Ten Myths About Machine Learning. Medium, 2016.
Debugging data: Microsoft researchers look at ways to train AI systems to reflect the real world. Microsoft AI Blog. | John Roach
– Alchemy: Statistical relational AI.
– SPN: Sum-product networks for tractable deep learning.
– RDIS: Recursive decomposition for nonconvex optimization.
– BVD: Bias-variance decomposition for zero-one loss.
– NBE: Bayesian learner with very fast inference.
– RISE: Unified rule- and instance-based learner.
– VFML: Toolkit for mining massive data sources.
– online machine learning class. Pedro Domingos (Link to series of YouTube videos)
– On Technology Innovation, AI and IoT. Interview with Philippe Kahn ODBMS Industry Watch, January 27, 2018
– On Artificial Intelligence and Analytics. Interview with Narendra Mulani ODBMS Industry Watch, August 12, 2017
– How Algorithms can untangle Human Questions. Interview with Brian Christian. ODBMS Industry Watch, March 31, 2017
–Big Data and The Great A.I. Awakening. Interview with Steve Lohr. ODBMS Industry Watch, December 19, 2016
–Machines of Loving Grace. Interview with John Markoff. ODBMS Indutry Watch, August 11, 2016
–On Artificial Intelligence and Society. Interview with Oren Etzioni. ODBMS Industry Watch, January 15, 2016
Follow us on Twitter: @odbmsorg
” An AI powered assistant can give you much better advice the more it knows about you and if it can collect data without burdening you. While this challenge creates the obvious but surmountable privacy issues, there is an interesting data integration challenge here to collect data from the digital breadcrumbs we leave all over, such as posts on social media, photos, data from wearables. Reconciling all these data sets into a meaningful and useful signal is a fascinating research problem!”–Alon Halevy
I have interviewed Alon Halevy, CEO of Megagon Labs. We talked about happiness, AI-powered journaling and the HappyDB database.
Q1. What is HappyDB?
Alon Halevy: HappyDB is a crowd-sourced text database of 100,000 answers to the following question: what made you happy in the last 24 hours (or 3 months)? Half of the respondents were asked about the last 24 hours and the other half about the last 3 months.
We collected HappyDB as part of our research agenda on technology for wellbeing. At a basic level, we’re asking whether it is possible to develop technology to make people happier. As part of that line of work, we are developing an AI-powered journaling application in which the user writes down the important experiences in their day. The goal is that the smart journal will understand over time what makes you happy and give you advice on what to do. However, to that end, we need to develop Natural Language Processing technology that can understand better the descriptions of these moments (e.g., what activity did the person do, with whom, and in what context). HappyDB was collected in order to create a corpus of text that will fuel such NLP research by our lab and by others.
Q2. The science of happiness is an area of positive psychology concerned with understanding what behaviors make people happy in a sustainable fashion. How is it possible to advance the state of the art of understanding the causes of happiness by simply looking at text messages?
Alon Halevy: One of the main observations of the science of happiness is that a significant part of people’s wellbeing is determined by the actions they choose to do on a daily basis (e.g., encourage social interactions, volunteer, meditate, etc). However, we are often not very good at making choices that maximize our sustained happiness because we’re focused on other activities that we think will make us happier (e.g., make more money, write another paper).
Because of that, we believe that a journaling application can give advice based on personal experiences that the user has had. The user of our application should be able to use text, voice or even photos to express their experiences.
The text in HappyDB is meant to facilitate the research required to understand texts given by users.
Q3. What are the main findings you have found so far?
Alon Halevy: The happy moments we see in HappyDB are not surprising in nature — they describe experiences that are known to make people happy, such as social events with family and friends, achievements at work and enjoying nature and mindfulness. However, given that these experiences are expressed in so many different ways in text, the NLP challenge of understanding the important aspects of these moments are quite significant.
Q4. The happy moments are crowd-sourced via Amazon’s Mechanical Turk. Why?
Alon Halevy: That was the only way we could think of getting such a large corpus. I should note that we only get 2-3 replies from each worker, so this is not a longitudinal study about how people’s happiness changes over time.
The goal is just to collect text describing happy moments.
Q5. You mentioned that HappyDB is a collection of happy moments described by individuals experiencing those moments. How do you verify if these statements reflect the true state of mind of people?
Alon Halevy: You can’t verify such a corpus in any formal sense, but when you read the moments you see they are completely natural. We even have a moment from one person who was happy for getting tenure!
Q6. What is a reflection period?
Alon Halevy: A reflection period is how far back you look for the happy moment. For example, moments that cover a reflection period of 24 hours tend to mention a social event or meal, while moments based on a reflection of 3 months tend to mention a bigger event in life such as the birth of a child, promotion, or graduation.
Q7. The HappyDB corpus, like any other human-generated data, has errors and requires cleaning. How do you handle this?
Alon Halevy: We did a little bit of spell correcting and removed some moments that were obviously bogus (too long, too short). But the hope is that the sheer size of the database is its main virtue and the errors will be minor in the aggregate.
Q8. What are the main NLP problems that can be studied with the help of this corpus?
Alon Halevy: There are quite a few NLP problems. The most basic is to figure out what is the activity that made the person happy (and distinguish the words describing the activity from all the extraneous text). Who are the people that were involved in the experience? Was there anything in the context that was critical (e.g, a sunset). We can ask more reflective questions, such as was the person happy from the experience because of a mismatch between their expectations and reality? Do men and women express happy experiences in different ways? Finally, can we create an ontology of activities that would cover the vast majority of happy moments and reliably map text to one or more of these categories.
Q9. What analysis techniques did you use to analyse HappyDB? Were you happy with the existing NLP techniques? or is there a need for deeper NLP techniques?
Alon Halevy: We clearly need new NLP techniques to analyze this corpus and ones like it. In addition to standard somewhat shallow NLP techniques, we are focusing on trying to define frame structures that capture the essence of happy moments and to develop semantic role labeling techniques that map from text to these frame structures and their slots.
Q10. Is HappyDB open to the public?
Qx Anything else you wish to add?
Alon Halevy: Yes, I think developing technology for wellbeing raises some interesting challenges for data management in general. An AI powered assistant can give you much better advice the more it knows about you and if it can collect data without burdening you. While this challenge creates the obvious but surmountable privacy issues, there is an interesting data integration challenge here to collect data from the digital breadcrumbs we leave all over, such as posts on social media, photos, data from wearables. Reconciling all these data sets into a meaningful and useful signal is a fascinating research problem!
Dr. Alon Halevy is a computer scientist, entrepreneur and educator. He received his Ph.D. in Computer Science at Stanford University in 1993. He became a professor of Computer Science at the University of Washington and founded the Database Research Group at the university.
He founded Nimble Technology Inc., a company providing an Enterprise Information Integration Platform, and TransformicInc., a company providing access to deep web content. Upon the acquisition of Transformicby Google Inc., he became responsible for research on structured data as a senior staff research scientist at Google’s head office and was engaged in research and development, such as developing Google Fusion Tables. He has served as CEO of Megagon Labs since 2016.
Dr. Halevy is a Fellow of the Association of Computing Machinery (ACM Fellow) and received the VLDB 10-year best paper award in 2006.
–Paper: HappyDB: A Corpus of 100,000 Crowdsourced Happy Moments , Akari Asai, Sara Evensen, Behzad Golshan, Alon Halevy, Vivian Li, Andrei Lopatenko, Daniela Stepanov, Yoshihiko Suhara, Wang-Chiew Tan, Yinzhan Xu
–Software: BigGorilla is an open-source data integration and data preparation ecosystem (powered by Python) to enable data scientists to perform integration and analysis of data. BigGorilla consolidates and documents the different steps that are typically taken by data scientists to bring data from different sources into a single database to perform data analysis. For each of these steps, we document existing technologies and also point to desired technologies that could be developed.
The different components of BigGorilla are freely available for download and use. Data scientists are encouraged to contribute code, datasets, or examples to BigGorilla. We hope to promote education and training for aspiring data scientists with the development, documentation, and tools provided through BigGorilla.
–Software: Jo Our work is inspired by psychology research, especially a field known as Positive Psychology. We are developing “Jo” – an agent that helps you record your daily activities, generalizes from them, and helps you create plans that increase your happiness. Naturally, this is no easy feat. Jo raises many exciting technical challenges for NLP, chatbot construction, and interface design: how can we build an interface that’s useful but not intrusive. Read more about Jo!
– Data Integration: From Enterprise Into Your Kitchen, Alon Halevy – SIGMOD/PODS Conference 2017
Follows us on Twitter: @odbmsorg
“I would argue that the definition of “small” keeps getting bigger as hardware improves and more economical storage options abound. As data volumes get bigger and bigger, organizations are looking to graduate out of the “small” arena and start to leverage big data for truly transformational projects. “–Ben Vandiver
I have interviewed Ben Vandiver, CTO at Vertica. Main topics of the interview are: Vertica database, the Cloud, and the new Vertica cloud architecture: Eon Mode.
Q1. Can you start by giving us some background on your role and history at Vertica?
Ben Vandiver: My bio covers a bit of this, but I’ve been at Vertica from version 2.0 to our newly released 9.1. Along the way I’ve seen Vertica transform from a database that could barely run SQL and delete records, to an enterprise grade analytics platform. I built a number of the core features of the database as a developer. Some of my side-projects turned into interesting features: Flex tables is Vertica’s schema-on-read mechanism and Key/Value allows fast, scalable single node queries. I started the Eon mode project 2 ½ years ago to enable Vertica to take advantage of variable workloads and shared storage, both on-premises and in the cloud. Upon promotion to CTO, I continue to remain engaged with development as a core architect, but I also look after product strategy, information flow within the Vertica organization, and technical customer engagement.
Q2. Is the assumption that “One size does not fit all” (aka Michael Stonebraker) still valid for new generation of databases?
Ben Vandiver: Mike’s statement of “One size does not fit all” still holds and if anything, the proliferation of new tools demonstrates how relevant that statement still is today. Each tool is designed for a specific purpose and an effective data analytics stack combines a collection of best-in-class tools to address an organization’s data needs.
For “small” problems, a single flexible tool can often address these needs. But what exactly is “small” in today’s world?
I would argue that the definition of “small” keeps getting bigger as hardware improves and more economical storage options abound. As data volumes get bigger and bigger, organizations are looking to graduate out of the “small” arena and start to leverage big data for truly transformational projects. These organizations would benefit from developing a data stack that incorporates the right tools – BI, ETL, data warehousing, etc. – for the right jobs, and choosing solutions that favour a more open, ecosystem-friendly architecture.
This belief is evident in Vertica’s own product strategy, where our focus is to build the most performant analytical database on the market, free from underlying infrastructure and open to a wide range of ecosystem integrations.
Q3. Vertica, like many databases, started off on-premises and has moved to the cloud. What has that journey looked like?
Ben Vandiver: Our pure software, hardware agnostic approach has enabled Vertica to be deployed in a wide variety of configurations, from embedded devices to multiple cloud platforms. Historically, most of Vertica’s deployments have been on-premises, but we’ve been building AMIs for running Vertica in the Amazon cloud since 2008. More recently, we have built integrations for S3 read/write and cloud monitoring.
In our 9.0 release last year, we extended our SQL-on-Hadoop offering to support Amazon S3 data in ORC or Parquet format, enabling customers to run highly-performant analytical queries against their Hadoop data lakes on S3.
And of course, with our latest 9.1 release, the general availability of Eon Mode represents a transformational leap in our cloud journey.
With Eon Mode, Vertica is moving from simply integrating with cloud services to introducing a core architecture optimized specifically for the cloud, so customers can capitalize on the economics of compute and storage separation.
Q4. Vertica just released a completely new cloud architecture, Eon Mode. Can you describe what that is and how it works?
Ben Vandiver: Eon Mode is a new architecture that places the data on a reliable, cost-effective shared storage, while matching Vertica Enterprise Mode’s performance on existing workloads and supporting entirely new use cases. While the design reuses Vertica’s core optimizer and execution engine, the metadata, storage, and fault tolerance mechanisms are re-architected to enable and take advantage of shared storage. A sharding mechanism distributes load over the nodes while retaining the capability of running node-local table joins.
A caching layer provides full Vertica performance on in-cache data and transparent query on non-cached data with mildly degraded performance.
Eon Mode initially supports running on Amazon EC2 compute and S3 storage, but includes an internal API layer that we have built to support our roadmap vision for other shared storage platforms such as Microsoft Azure, Google Cloud, or HDFS.
Eon Mode demonstrates strong performance, superior scalability, and robust operational behavior.
With these improvements, Vertica delivers on the promise of cloud economics, by allowing customers to provision only the compute and storage resources needed – from month to month, day to day, or hour to hour – while supporting efficient elasticity. For organizations that have more dynamic workloads, this separation of compute and storage architecture represents a significant opportunity for cloud savings and operational efficiency.
Q5. What are the similarities and differences between Vertica Enterprise Mode and Vertica Eon Mode?
Ben Vandiver: Eon Mode and Enterprise Mode have both significant similarities and differences.
Both are accessible from the same RPM – the choice of mode is determined at the time of database deployment. Both use the same cost-based distributed optimizer and data flow execution engine. The same SQL functions that run on Enterprise Mode will also run on Eon Mode, along with Vertica’s extensions for geospatial, in-database machine learning, schema-on-read, user-defined functions, time series analytics, and so on.
The fundamental difference however, is that Enterprise Mode deployments must provision storage capacity for the entire dataset whereas Eon Mode deployments are recommended to have cache for the working set. Additionally, Eon Mode has a lightweight re-subscribe and cache warming step which speeds recovery for down nodes. Eon Mode can rapidly scale out elastically for performance improvements which is the key to aligning resources to variable workloads, optimizing for cloud economics.
Many analytics platforms offered by cloud providers are not incentivized to optimize infrastructure costs.
Q6. How does Vertica distribute query processing across the cluster in Eon Mode and implement load balancing?
Ben Vandiver: Eon Mode combines a core Vertica concept, Projections, with a new sharding mechanism to distribute processing load across the cluster.
A Projection describes the physical storage for a table, stipulating columns, compression, sorting, and a set of columns to hash to determine how the data is laid out on the cluster. Eon introduces another layer of indirection, where nodes subscribe to and serve data for a collection of shards. During query processing, Vertica assembles a node to serve each shard, selecting from available subscribers. For an elastically scaled out cluster, each query will run on just some of the nodes of the cluster. The administrator can designate sub-clusters of nodes for workload isolation: clients connected to a sub-cluster run queries only on nodes in the sub-cluster.
Q7. What do you see as the primary benefits of separating compute and storage?
Ben Vandiver: Since storage capacity is decoupled from compute instances, an Eon Mode cluster can cost-effectively store a lot more data than an Enterprise Mode deployment. The resource costs associated with maintaining large amounts of historical data is minimized with Eon Mode, discouraging using two different tools (such as a data lake and a query engine) for current and historical queries.
The operational cost is also minimized since node failures are less impactful and easier to recover from.
On the flip side, running many compute instances against a small shared data set provides strong scale-out performance for interactive workloads. Elasticity allows movement between the two extremes to align resource consumption with dynamic needs. And finally, the operational simplicity of Eon Mode can be impactful to the health and sanity of the database administrators.
Q8. What types of engineering challenges had to be overcome to create and launch this new architecture?
Ben Vandiver: Eon Mode is an application of core database concepts to a cloud environment. Even though much of the core optimizer and execution engine functionality remains untouched, large portions of the operational core of the database are different in Eon Mode. While Vertica’s storage usage maps well to an object store like S3, determining when a file can be safely deleted was an interesting challenge. We also migrated a significant amount of our test infrastructure to AWS.
Finally, Vertica is a mature database, having been around for over 10 years – Eon Mode doesn’t have the luxury to launch as a 0.1 release full of bugs. This is why Eon Mode has been in Beta, both private and public, for the last year.
Q9. It’s still early days for Eon Mode’s general availability, but do you have any initial customer feedback or performance benchmarks?
Ben Vandiver: Although Eon Mode just became generally available, it’s been in Beta for the last year and a number of our Beta customers have had significant success with this new architecture. For instance, one large gaming customer of ours subjected a much smaller Eon Mode deployment to their full production load, and realized 30% faster load rates without any tuning. Some of their queries ran 3-6x faster, even when spilling out of the cache. Operationally, the company’s node recovery was 6-8x faster and new nodes could be added in under 30 minutes. Eon Mode is enabling this customer to not only improve query performance, but the dynamic AWS service consumption resulted in dramatic cost savings as well.
Q10. What should we expect from Vertica in the future with respect to cloud and Eon Mode product development?
Ben Vandiver: We are working on expanding Eon Mode functionality in a variety of dimensions. By distributing work for a shard among a collection of nodes, Eon Mode can get more “crunch” from adding nodes, thus improving elasticity. Operationally, we are working on better support for sub-clusters, no-downtime upgrade, auto-scaling, and backup snapshots for operator error. As mentioned previously, deployment options like Azure cloud, Google cloud, HDFS, and other on-premises technologies are on our roadmap. Our initial 9.1 Eon Mode release is just the beginning. I’m excited at what the future holds for Vertica and the innovations we continue to bring to market in support of our customers.
I spent many years at MIT, picking up a bachelor’s, master’s, and PhD (My thesis was on Byzantine Fault Tolerance of Databases). I have a passion for teaching, having spent several years teaching computer science.
From classes of 25 to 400, I enjoy finding clear ways to explain technical concepts, untangle student confusion, and have fun in the process. The database group at MIT, located down the hall from my office, developed Vertica’s founding C-Store paper.
I joined Vertica as a software engineer in August 2008. Over the years, I worked on many areas of the product including transactions, locking, WOS, backup/restore, distributed query, execution engine, resource pools, networking, administrative tooling, metadata management, and so on. If I can’t answer a technical question myself, I can usually point at the engineer who can. Several years ago I made the transition to management, running the Distributed Infrastructure, Execution Engine, and Security teams. I believe in an inclusive engineering culture where everyone shares knowledge and works on fun and interesting problems together – I sponsor our Hackathons, Crack-a-thon, Tech Talks, and WAR Rooms.
More recently, I’ve been running the Eon project, which aims to support a cloud-ready design for Vertica running on shared storage. While engineering is where I spend most of my time, I occasionally fly out to meet customers, notably a number of bigger ones in the Bay area. I was promoted to Vertica CTO in May 2017.
– For more information on Vertica in Eon Mode, read the technical paper: Eon Mode: Bringing the Vertica Columnar Database to the Cloud.
– To learn more about Vertica’s cloud capabilities visit www.vertica.com/clouds
– On RDBMS, NoSQL and NewSQL databases. Interview with John Ryan ODBMS Industry Watch, 2018-03-09
– On Vertica and the new combined Micro Focus company. Interview with Colin Mahony ODBMS Industry Watch, 2017-10-25
Follow us on Twitter: @odbmsorg
“Time series data” are sequential series of data about things that change over time, usually indexed by a timestamp. The world around us is full of examples –Andrei Gorine.
I have interviewed Andrei Gorine, Chief Technical Officer and McObject co-founder.
Main topics of the interview are: time series analysis, “in-chip” analytics, efficient Big Data processing, and the STAC M3 Kanaga Benchmark.
Q1. Who is using time series analysis?
Andrei Gorine: “Time series data” are sequential series of data about things that change over time, usually indexed by a timestamp. The world around us is full of examples — here are just a few:
• Self-driving cars continuously read data points from the surrounding environment — distances, speed limits, etc., These readings are often collected in the form of time-series data, analyzed and correlated with other measurements or onboard data (such as the current speed) to make the car turn to avoid obstacles, slow-down or speed up, etc.
• Retail industry point-of-sale systems collect data on every transaction and communicate that data to a back-end where it gets analyzed in real-time, allowing or denying credit, dispatching goods, and extending subsequent relevant retail offers. Every time a credit card is used, the information is put into a time series data store where it is correlated with other related data through sophisticated market algorithms.
• Financial market trading algorithms continuously collect real-time data on changing markets, run algorithms to assess strategies and maximize the investor’s return (or minimize loss, for that matter).
• Web services and other web applications instantly register hundreds of millions of events every second, and form responses through analyzing time-series data sets.
• Industrial automation devices collect data from millions of sensors placed throughout all sorts of industrial settings — plants, equipment, machinery, environment. Controllers run analysis to monitor the “health” of production processes, making instant control decisions, sometimes preventing disasters, but more often simply ensuring uneventful production.
Q2. Why is a columnar data layout important for time series analysis?
Andrei Gorine: Time-series databases have some unique properties dictated by the nature of the data they store. One of them is the simple fact that time-series data can accumulate quickly, e.g. trading applications can add millions of trade-and-quote (“TAQ”) elements per second, and sensors based on high-resolution timers generate piles and piles data. In addition, time-series data elements are normally received by, and written into, the database in timestamp order. Elements with sequential timestamps are arranged linearly, next to each other on the storage media. Furthermore, a typical query for time-series data is an analytical query or aggregation based on the data’s timestamp (e.g. calculate the simple moving average, or volume weighted average price of a stock over some period). In other words, data requests often must gain access to a massive number of elements (in real-life often millions of the elements) of the same time series.
The performance of a database query is directly related to the number of I/O calls required to fulfill the request: less I/O contributes to greater performance. Columnar data layout allows for significantly smaller working set data sizes – the hypothetical per-column overhead is neglible compared to per-row overhead. For example, given a conservative 20 bytes per-row overhead, storing 4-byte measurements in the horizontal layout (1 time series entry per row) requires 6 times more space than in the columnar layout (e.g. each row consumes 24 bytes, whereas one additional element in a columnar layout requires just 4 bytes). Since there is less storage space required to store time series data in the columnar layout, less I/O calls are required to fetch any given amount of real (non-overhead) data. Another space-saving feature of the columnar layout is content compression —columnar layout allows for far more efficient and algorithmically simpler compression algorithms (such as run-length encoding over a column). Lastly, row-based layout contains many columns. In other words, when the database run-time reads a row of data (or, more commonly, a page of rows), it is reading many columns. When analytics require only one column (e.g. to calculate an aggregate of a time-series over some window of time), it is far more efficient to read pages of just that column.
The sequential pattern in which time-series data is stored and retrieved in columnar databases (often referred to as “spatial locality”) leads to a higher probability of preserving the content of various cache subsystems, including all levels of CPU cache, while running a query. Placing relevant data closer to processing units is vitally important for performance: L1 cache access is 3 times faster than L2 cache access, 10 times faster than L3 unshared line access and 100 times faster than access to RAM (i7 Xeon). In the same vein, the ability to utilize various vector operations, and the ever growing set of SIMD instruction sets (Single Instruction Multiple Data) in particular, contribute to speedy aggregate calculations. Examples include SIMD vector instructions that operate on multiple values contained in one large register at the same time, SSE (Streaming SIMD Extensions) instruction sets on Intel and AltiVec instructions on PowerPC, pipelining (i.e. computations inside the CPU that are done in stages), and much more.
Q3. What is “on-chip” analytics and how is it different than any other data analytics?
Andrei Gorine: This is also referred to as “in-chip” analytics. The concept of pipelining has been successfully employed in computing for decades. Pipelining is referred to a series of data processing elements where the output of one element is the input of the next one. Instruction pipelines have been used in CPU designs since the introduction of RISC CPUs, and modern GPUs (graphical processors) pipeline various stages of common rendering operations. Elements of software pipeline optimizations are also found in operating system kernels.
Time-series data layouts are perfect candidates to utilize a pipelining approach. Operations (functions) over time-series data (in our product time series data are called “sequences”) are implemented through “iterators”. Iterators carry chunks of data relevant to the function’s execution (we call these chunks of data “tiles”). Sequence functions receive one or more input iterators, perform required calculations and write the result to an output iterator. The output iterator, in turn, is passed into the next function in the pipeline as an input iterator, building a pipeline that moves data from the database storage through the set of operations up to the result set in memory. The “nodes” in this pipeline are operations, while the edges (“channels”) are iterators. The interim operation results are not materialized in memory. Instead the “tiles” of elements are passed through the pipeline, where each tile is referenced by an iterator. The tile is the unit of data exchange between the operators in the pipeline. The tile size is small enough keep the tile in the top-level L1 CPU cache and large enough to allow for efficient use of superscalar and vector capabilities of modern CPUs. For example, 128 time-series elements fit into a 32K cache. Hence the term “on-chip” or “in-chip” analytics. As mentioned, top-level cache access is 3 times faster than level two (L2) cache access.
To illustrate the approach, consider the operation x*y + z where x ,y and z are large sequences, or vectors if you will (perhaps megabytes or even gigabytes). If the complete interim result of the first operation (x*y) is created, then at the moment the last element of it is received, the first element of the interim sequence is already pushed out of the cache. The second operation (+ z) would have to load it from memory. Tile-based pipelining avoids this scenario.
Q4. What are the main technical challenges you face when executing distributed query processing and ensuring at the same time high scalability and low latency when working with Big Data?
Andrei Gorine: Efficient Big Data processing almost always requires data partitioning. Distributing data over multiple physical nodes, or even partitions on the same node, and executing software algorithms in parallel allows for better hardware resource utilization through maximizing CPU load and exploiting storage media I/O concurrency. Software lookup and analytics algorithms take advantage of each node’s reduced data set through minimizing memory allocations required to run the queries, etc. However, distributed data processing comes loaded with many challenges. From the standpoint of the database management system, the challenges are two-fold: distributed query optimization and data distribution
First is optimizing distributed query execution plans so that each instance of the query running on a local node is tuned to minimize the I/O, CPU, buffer space and communications cost. Complex queries lead to complex execution plans. Complex plans require efficient distribution of queries through collecting, sharing and analyzing statistics in the distributed setup. Analyzing statistics is not a trivial task even locally, but in the distributed system environment the complexity of the task is an order of magnitude higher.
Another issue that requires a lot of attention in the distributed setting is runtime partition pruning. Partition pruning is an essential performance feature. In a nutshell, in order to avoid compiling queries every time, the queries are prepared (for example, “select ..where x=10” is replaced with “select ..where x=?”). The problem is that in the un-prepared form, the SQL compiler is capable of figuring out that the query is best executed on some known node. Yet in the second, prepared form, that “best” node is not known to the compiler. Thus, the choices are either sending the query to every node, or locating the node with the given key value during the execution stage.
Even when the SQL execution plan is optimized for distributed processing, the efficiency of distributed algorithms heavily depends on the data distribution. Thus, the second challenge is often to figure out the data distribution algorithm so that a given set of queries are optimized.
Data distribution is especially important for JOIN queries — this is perhaps one of the greatest challenges for distributed SQL developers. In order to build a truly scalable distributed join, the best policy is to have records from all involved tables with the same key values located on the same node. In this scenario all joins are in fact local. But, in practice, this distribution is rare. A popular JOIN technique is to use “fact” and “dimension” tables on all nodes while sharding large tables. However, building dimension tables requires special attention from application developers. The ultimate solution to the distributed JOIN problem is to implement the “shuffle join” algorithm. Efficient shuffle join is, however, very difficult to put together.
Q5. What is the STAC M3 Kanaga Benchmark and what is it useful for?
Andrei Gorine: STAC M3 Kanaga simulates financial applications’ patterns over large sets of data. The data is represented via simplified historical randomized datasets reflecting ten years of trade and quote (TAQ) data. The entire dataset is about 30 terabytes in size. The Kanaga test suite consists of a number of benchmarks aimed to compare different aspects of its “System Under Test” (SUT), but mostly to highlight performance benefits of the hardware and DBMS software utilized. The Kanaga benchmark specification was written by practitioners from global banks and trading firms to mimic real-life patterns of tick analysis. Our implementations of the STAC-M3 benchmark aim to fully utilize the underlying physical storage I/O channels, and maximize CPU load by dividing the benchmark’s large dataset into a number of smaller parts called “shards”. Based on the available hardware resources, i.e. the number of CPU cores and physical servers, I/O channels, and sometimes network bandwidth, the number of shards can vary from dozens to hundreds. Each shard’s data is then processed by the database system in parallel, usually using dedicated CPU cores and media channels, and the results of that processing (calculated averages, etc.) are combined into a single result set by our distributed database management system.
The Kanaga test suite includes a number of benchmarks symptomatic of financial markets application patterns:
An I/O bound HIBID benchmark that calculates the high bid offer value over period of time — one year for Kanaga. The database management system optimizes processing through parallelizing time-series processing and extensive use of single instruction, multiple data (SIMD) instructions, yet the total IOPS (Inputs/Outputs per Second) that the physical storage is capable of is an important factor in receiving better results.
The “market snapshot” benchmark stresses the SUT — the database and the underlying hardware storage media, requiring them to perform well under high-load parallel workload that simulates real-world financial applications’ multi-user data access patterns. In this test, the (A) ability to execute columnar-storage operations in parallel, (B) efficient indexing and (C) low storage I/O latency play important roles in getting better results.
The volume-weighted average bid (VWAB) benchmarks over a one day period. On the software side, the VWAB benchmarks benefit from the use of the columnar storage and analytics function pipelining discussed above to maximize efficient CPU cache utilization and CPU bandwidth and reduce main memory requirements . Hardware-wise, I/O bandwidth and latency play a notable role.
Andrei Gorine, Chief Technical Officer, McObject.
McObject co-founder Andrei leads the company’s product engineering. As CTO, he has driven growth of the eXtremeDB real-time embedded database system, from the product’s conception to its current wide usage in virtually all embedded systems market segments. Mr. Gorine’s strong background includes senior positions with leading embedded systems and database software companies; his experience in providing embedded storage solutions in such fields as industrial control, industrial preventative maintenance, satellite and cable television, and telecommunications equipment is highly recognized in the industry. Mr. Gorine has published articles and spoken at many conferences on topics including real-time database systems, high availability, and memory management. Over the course of his career he has participated in both academic and industry research projects in the area of real-time database systems. Mr. Gorine holds a Master’s degree in Computer Science from the Moscow Institute of Electronic Machinery and is a member of IEEE and ACM.
Follow us on Twitter: @odbmsorg
“With IBM having contributed huge amounts of code and other resources to Spark we are likely to see an explosion in the number of new machine learning components.”–Leon Guzenda
I have interviewed Leon Guzenda, co- founder of Objectivity, Inc. We covered in the interview: the Industrial Internet of Things, Sensor Fusion systems and ThingSpan.
Q1. What is the Industrial Internet of Things (IIoT) ? How is it different from the Internet of Things (IoT) ?
Leon Guzenda: The IIoT generally refers to the application of IoT technologies to manufacturing or process control problems. As such it is a subset of IoT with specialized extensions for the problems that it has to tackle.
Q2. What is a sensor fusion system?
Leon Guzenda: A sensor fusion system takes data streamed from multiple sensors and combines it, and possibly other data, to form a composite view of a situation or system. An example would be combining data from different kinds of reconnaissance sources, such as images, signals intelligence and infrared sensors, taken from different viewpoints to produce a 3D visualization for tracking or targeting purposes.
Leon Guzenda: Some sensor fusion systems combine one or a few types of data from multiples sources, such as the detectors in a linear accelerator, or measurements from medical instruments, though there may be many variants of a single kind of data. However, most have to handle a wide variety of data types, ranging from video to documents and streams of financial or other information. The data may be highly interconencted by many types of relationship, forming tree or graph structures. The DBMS must make it easy to track the provenance and quality of data as algorithms are applied to the raw data to make it suitable for downstream processes and queries. The DBMS must have low latency, i.e. the time from receiving data to it being available to multiple users. It has to be able to cope with fast moving streams of data in addition to small transactions and batched inputs. Above all, it must be able to scale and work in distributed environments.
Q4. What are the main technical challenges in capturing and analysing information from many different sources in near real-time for new insights in order to make critical decisions?
Leon Guzenda: The DBMS must have the ability to support compute intensive algorithms, which generally precludes the use of tabular schemas. There is a trend to suing modular, open source components, such as Spark Machine Learning Library (MLlib), so support for Spark Dataframes is important in some applications. It must have a flexible schema so that it can adapt rapidly to deal with new or changed data sources. Maintaining consistent, low latency is challenging when fast moving streams of incoming data have to be merged with and correlated with huge volumes of existing data.
Q5. Why does ‘after the fact analysis’ not work with real-time and data sensor fusion systems?
Leon Guzenda: Processes managed with the help of sensors and fusion systems may fail or get out of control if action isn’t taken immediately when changes occur. In other cases, opportunities may be lost if resources can’t be brought to bear on a problem, be it a cybersecurity or physical threat.
Q6. What is the impact of open source technologies, such as Spark, Kafka, HDFS, YARN, for the Industrial Internet of Things?
Leon Guzenda: Apache Spark provides a scalable, standard and flexible platform for bringing multiple components together to build standard or ad hoc workflows, e.g. with YARN. Kafka and Samza make it easier to split streams of data into pipelines for parallel ingest and query handling. HDFS is good for reliably storing files, but it is far from ideal for handling randomly accessed data as it moves data in 64 MB blocks, increasing latency. Nevertheless, ThingSpan can run on HDFS with data cached by Spark, but we prefer to run it on industry standard POSIX filesystems for most purposes.
With IBM having contributed huge amounts of code and other resources to Spark we are likely to see an explosion in the number of new machine learning components. By combining this with ThingSpan’s graph analytics capabilities, we’ll be able to attack new kinds of problem.
Q7. Why ThingSpan’s offer DO as a query language and not an extension of SQL?
Leon Guzenda: We would like to contribute the graph processing ideas in DO to the SQL community and are seeking partners to try to make that happen. However, our customers need a solution now, so we considered open source options, such as Cypher and SparQL. However, we decided that it would be faster and more controllable to leverage the flexible schema and query handling components within the ThingSpan kernel to give our products a competitive edge, particularly at scale.
Q8. What are the similarities and differences between ThingSpan and Neo4j? They both handle complex graphs.
Leon Guzenda: Both handle Vertex and Edge objects. Neo4j depends on properties whereas ThingSpan can also operate with connections that have no data within them. The ThingSpan declarative query language, DO, incorporates most of the graph querying capabilities of Cypher and extends them with advanced parallel pathfinding capabilities.
However, the main differentiator is performance as a graph scales. Although Neo4j has been introducing some distributed operations and has a port for Spark, it is inherently not a distributed DBMS with a single logical view of all of the data within a repository. Although it is capable of handling graphs with millions of nodes it hasn’t shown the ability to handle very large graphs. Objectivity has customers processing tens of trillions of nodes and connections per day for thousands of analysts.
Q9. ThingSpan and Objectivity/DB: how do they relate with each other (if any)?
Leon Guzenda: ThingSpan uses Objectivity/DB as its data repository. Besides the Java, C++, C# and Python APIs It also has a REST API and adaptors for Spark Dataframes and HDFS. Objectivity/DB is a component of the ThingSpan suite and can be purchased on its own for embedded applications or to run in non-Spark environments.
Q10. What kinds of things are on the roadmap for ThingSpan?
Leon Guzenda: We recently announced the availability of ThingSpan on the Amazon AWS Market Place, making it easier to evaluate and deploy ThingSpan in a resilient, elastic cloud environment.
The next release, which is in QA at the moment, will add high speed pipelining for ingesting streamed data. It also has extensions to DO, particularly in regard to pathfinding and schema manipulation. There is also a new graph visualization tool for developers.
He worked with Objectivity’s major partners and customers to help them deploy the industry’s highest-performing, most reliable DBMS technology. Leon has over 40 years experience in the software industry. At Automation Technology Products, he managed the development of the ODBMS for the Cimplex solid modeling and numerical control system. Before that he was Principal Project Director for the Dataskil division of International Computers Ltd. in the United Kingdom, delivering major projects for NATO and leading multinationals. He was also design and development manager for ICL’s 2900 IDMS product at ICL Bracknell. He spent the first 7 years of his career working in defense and government systems.
Follow us on Twitter: @odbmsorg
“We believe that businesses today are looking for ways to leverage the large amounts of data collected, which is driving them to try to minimize, or eliminate, the delay between event, insight, and action to embed data-driven intelligence into their real-time business processes.” –Simon Player
I have interviewed Simon Player, Director of Development for TrakCare and Data Platforms, Helene Lengler, Regional Director for DACH & BeNeLux, and Joe Lichtenberg, Director of Marketing for Data Platforms. All three work at InterSystems. We talked about the new InterSystems IRIS Data Platform.
Q1. You recently announced the InterSystems IRIS Data Platform®. What is it?
Simon Player: We believe that businesses today are looking for ways to leverage the large amounts of data collected, which is driving them to try to minimize, or eliminate, the delay between event, insight, and action to embed data-driven intelligence into their real-time business processes.
It is time for database software to evolve and offer multiple capabilities to manage that business data within a single, integrated software solution. This is why we chose to include the term ‘data platform’ in the product’s name.
InterSystems IRIS Data Platform supports transactional and analytic workloads concurrently, in the same engine, without requiring moving, mapping, or translating the data, eliminating latency and complexity. It incorporates multiple, disparate and dissimilar data sources, supports embedded real-time analytics, easily scales for growing data and user volumes, interoperates seamlessly with other systems, and provides flexible, agile, Dev Ops-compatible deployment capabilities.
InterSystems IRIS provides concurrent transactional and analytic processing capabilities; support for multiple, fully synchronized data models (relational, hierarchical, object, and document); a complete interoperability platform for integrating disparate data silos and applications; and sophisticated structured and unstructured analytics capabilities supporting both batch and real-time use cases in a single product built from the ground up with a single architecture. The platform also provides an open analytics environment for incorporating best-of-breed analytics into InterSystems IRIS solutions, and offers flexible deployment capabilities to support any combination of cloud and on-premises deployments.
Q2. How is InterSystems IRIS Data Platform positioned with respect to other Big Data platforms in the market (e.g. Amazon Web Services, Cloudera, Hortonworks Data Platform, Google Cloud Platform, IBM Watson Data Platform and Watson Analytics, Oracle Data Cloud system, Microsoft Azure, to name a few) ?
Joe Lichtenberg: Unlike other approaches that require organizations to implement and integrate different technologies, InterSystems IRIS delivers all of the functionality in a single product with a common architecture and development experience, making it faster and easier to build real-time, data rich applications. However it is an open environment and can integrate with existing technologies already in use in the customer’s environment.
Q3. How do you ensure High Performance with Horizontal and Vertical Scalability?
Simon Player: Scaling a system vertically by increasing its capacity and resources is a common, well-understood practice. Recognizing this, InterSystems IRIS includes a number of built-in capabilities that help developers leverage the gains and optimize performance. The main areas of focus are Memory, IOPS and Processing management. Some of these tuning mechanisms operate transparently, while others require specific adjustments on the developer’s own part to take full advantage.
One example of those capabilities is parallel query execution, built on a flexible infrastructure for maximizing CPU usage, it spawns one process per CPU core, and is most effective with large data volumes, such as analytical workloads that make large aggregation.
When vertical scaling does not provide the complete solution—for example, when you hit the inevitable hardware (or budget) ceiling—data platforms can also be scaled horizontally. Horizontal scaling fits very well with virtual and cloud infrastructure, in which additional nodes can be quickly and easily provisioned as the workload grows, and decommissioned if the load decreases.
InterSystems IRIS accomplishes this by providing the ability to scale for both increasing user volume and increasing data volume.
For increased user capacity, we leverage a distributed cache with an architectural solution that partitions users transparently across a tier of application servers sitting in front of our data server(s). Each application server handles user queries and transactions using its own cache, while all data is stored on the data server(s), which automatically keeps the application server caches in sync.
For increased data volume, we distribute the workload to a sharded cluster with partitioned data storage, along with the corresponding caches, providing horizontal scaling for queries and data ingestion. In a basic sharded cluster, a sharded table is partitioned horizontally into roughly equal sets of rows called shards, which are distributed across a number of shard data servers. For example, if a table with 100 million rows is partitioned across four shard data servers, each stores a shard containing about 25 million rows. Queries against a sharded table are decomposed into multiple shard-local queries to be run in parallel on multiple servers; the results are then transparently combined and returned to the user. This distributed data layout can further be exploited for parallel data loading and with third party frameworks like Apache Spark.
Horizontal clusters require greater attention to the networking component to ensure that it provides sufficient bandwidth for the multiple systems involved and is entirely transparent to the user and the application.
Q4. How can you simultaneously processes both transactional and analytic workloads in a single database?
Simon Player: At the core of InterSystems IRIS is a proven, enterprise-grade, distributed, hybrid transactional-analytic processing (HTAP) database. It can ingest and store transactional data at very high rates while simultaneously processing high volumes of analytic workloads on real-time data (including ACID-compliant transactional data) and non-real-time data. This architecture eliminates the delays associated with moving real-time data to a different environment for analytic processing. InterSystems IRIS is built on a distributed architecture to support large data volumes, enabling organizations to analyze very large data sets while simultaneously processing large amounts of real-time transactional data.
Q5. There are a wide range of analytics, including business intelligence, predictive analytics, distributed big data processing, real-time analytics, and machine learning. How do you support them in the InterSystems IRIS Data Platform?
Simon Player: Many of these capabilities are built into the platform itself and leverage that tight integration to simultaneously processes both transactional and analytic workloads; however, we realize that there are multiple use cases where customers and partners would like InterSystems IRIS Data Platform to access data on other systems or to build solutions that leverage best-of-breed tools (such as ML algorithms, Spark etc.) to complement our platform and quickly access data stored on it.
That’s why we chose to provide open analytics capabilities supporting industry standard APIs such as UIMA, Java Integration, xDBC and other connectivity options.
Q6. What about third-party analytics tools?
Simon Player: The InterSystems IRIS Data Platform offers embedded analytics capabilities such as business intelligence, distributed big data processing & natural language processing, which can handle both structured and unstructured data with ease. It is designed as an Open Analytics Platform, built around a universal, high-performance and highly scalable data store.
Third-party analytics tools can access data stored on the platform via standard APIs including ODBC, JDBC, .NET, SOAP, REST, and the new Apache Spark Connector. In addition, the platform supports working with industry-standard analytical artifacts such as predictive models expressed in PMML and unstructured data processing components adhering to the UIMA standard.
Q7. How does InterSystems IRIS Data Platform integrate into existing infrastructures and with existing best-of-breed technologies (including your own products)?
Simon Player: InterSystems IRIS offers a powerful, flexible integration technology that enables you to eliminate “siloed” data by connecting people, processes, and applications. It includes the comprehensive range of technologies needed for any connectivity task.
InterSystems IRIS can connect to your existing data and applications, enabling you to leverage your investment, rather than “ripping and replacing.” With its flexible connectivity capabilities, solutions based on InterSystems IRIS can easily be deployed in any client environment.
A comprehensive library of adapters provides out-of-the-box connectivity and data transformations for packaged applications, databases, industry standards, protocols, and technologies – including SQL, SOAP, REST, HTTP, FTP, SAP, TCP, LDAP, Pipe, Telnet, and Email.
Object inheritance minimizes the effort required to build any needed custom adapters. Using InterSystems IRIS’ unit testing service, custom adapters can be tested without first having to complete the entire solution. Traceability of each event allows efficient analysis and debugging.
The InterSystems IRIS messaging engine offers guaranteed message delivery, content-based routing, high-performance message transformation, and support for both synchronous and asynchronous interactions. InterSystems IRIS has a graphical editor for business process orchestration, a business rules engine, and a workflow editor that enable you to automate your enterprise-wide business procedures or create new composite applications. With world-class support for XML, SOAP, JSON and REST, InterSystems
Because it includes a high performance transactional-analytic database, InterSystems IRIS can store and analyze messages as they flow through your system. It enables business activity monitoring, alerting, real-time business intelligence, and event processing.
· Other integration point with industry standards or best-of-breed technologies include the ability to easily transport files between client machines and the server in a secure via our Managed File Transfer (MFT) capability. This functionality leverages state-of-the-art MFT providers like Box, Dropbox and KiteWorks to provide a simple client that non-technical users can install and companies can pre-configure and brand. InterSystems IRIS connects with these providers as a peer and exposes common APIs (e.g. to manage users)
· When using Apache Spark for large distributed data processing and analytics tasks, the Spark Connector will leverage the distributed data layout of sharded tables and push computation as close to the data as possible, increasing parallelism and thus overall throughput significantly vs regular JDBC connections.
Q8. What market segments do you address with IRIS Data Platform?
Helene Lengler: InterSystems IRIS is an open platform that suits virtually any industry, but we will be initially focusing on a couple of core market segments, primarily due to varying regional demand. For instance, we will concentrate on the financial services industry in the US or UK and the retail and logistics market in the DACH and Benelux regions. Additionally, in Germany and Japan, our major focus will be on the manufacturing industry, where we see a rapidly growing demand for data-driven solutions, especially in the areas of predictive maintenance and predictive analytics.
We are convinced that InterSystems IRIS is ideal for this and also for other kinds of IoT applications with its ability to handle large-scale transactional and analytic workloads On top of this, we are also looking to engage with companies that are at the very beginning of product development – in other words, start-ups and innovators working on solutions that require a robust, future-proof data platform.
Q9. Are there any proof of concepts available?
Helene Lengler: Yes. Although the solution has only been available to selected partners for a couple of weeks, we have already completed the first successful migration in Germany. A partner that is offering an Enterprise Information Management System, which allows organizations to archive and access all of an organization’s data, documents, emails and paper files has been able to migrate from InterSystems Caché to InterSystems IRIS in as little as a couple of hours and – most importantly – without any issues at all. The partner decided to move to InterSystems IRIS because they are in the process of signing a contract with one of the biggest players in the German travel & transport industry. With customers like this, you are looking at data volumes in the Petabyte range very, very shortly, meaning you require the right technology from the start in order to be able to scale horizontally – using the InterSystems IRIS technologies such as sharding – as well as vertically.
In addition, we were able to show a live IoT demonstrator at our InterSystems DACH Symposium in November 2017. This proof of concept is actually a lighthouse example of what the new platform’s brings to the table: A team of three different business partners and InterSystems experts leveraged InterSystems IRIS’ capabilities to rapidly develop and implement a fully functional solution for a predictive maintenance scenario. Numerous other test scenarios and PoC’s are currently being conducted in various industry segments with different partners around the globe.
Q10. Can developers already use InterSystems IRIS Data Platform?
Simon Player: Yes. Starting on 1/31, developers can use our sandbox, the InterSystems IRIS Experience, at www.intersystems.com/experience.
Qx. Anything else you wish to add?
Simon Player: The public is welcome to join the discussion on how to graduate from database to data platform on our developer community at https://community.intersystems.com.
Simon Player is director of development for both TrakCare and Data Platforms at InterSystems. Simon has used and developed on InterSystems technologies since the early 1990s. He holds a BSc in Computer Sciences from the University of Manchester.
Helene Lengler is the Regional Managing Director for the DACH and Benelux regions. She joined InterSystems in July 2016 and has more than 25 years of experience in the software technology industry. During her professional career, she has held various senior positions at Oracle, including Vice President (VP) Sales Fusion Middleware and member of the executive board at Oracle Germany, VP Enterprise Sales and VP of Oracle Direct. Prior to her 16 years at Oracle, she worked for the Digital Equipment Corporation in several business disciplines such as sales, marketing and presales.
Helene holds a Masters degree from the Julius-Maximilians-University in Würzburg and a post-graduate Business Administration degree from AKAD in Pinneberg.
Joe Lichtenberg is responsible for product and industry marketing for data platform software at InterSystems. Joe has decades of experience working with various data management, analytics, and cloud computing technology providers.
Follow up on Twitter: @odbsmorg
“There is a lot of hype about the dangers of IoT and AI. It’s important to understand that nobody is building Blade-Runner style replicants.” — Philippe Kahn
I have interviewed Philippe Kahn. Philippe is a mathematician, well known technology innovator, entrepreneur and founder of four technology companies: Fullpower Technologies, LightSurf Technologies, Starfish Software and Borland.
Q1. Twenty years ago, you spent about a year working on a Web-based infrastructure that you called Picture Mail. Picture Mail would do what we now call photo “sharing”. How come it took so long before the introduction of the iPhone, Snapchat, Instagram, Facebook Live and co.?
Philippe Kahn: Technology adoption takes time. We designed a system where a picture would be stored once and a link-back would be sent as a notification to thousands. That’s how Facebook and others function today. At the time necessity created function because for wireless devices and the first Camera-Phones/Cellphone-Cameras the bandwidth on cellular networks was 1200 Baud at most and very costly. Today a picture or a video are shared once on Facebook and millions/billions can be notified. It’s exactly the same approach.
Q2. Do you have any explanation why established companies such as Kodak, Polaroid, and other camera companies (they all had wireless camera projects at that time), could not imagine that the future was digital photography inside the phone?
Philippe Kahn: Yes, I met with all of them. Proposed our solution to no avail. They had an established business and thought that it would never go away and they could wait. They totally missed the paradigm shift. Paradigm shifts are challenges for any established player, look at the demise of Nokia for missing the smartphone.
Q3. What is your take on Citizen journalism?
Philippe Kahn: Citizen journalism is one of the pillars of future democracy. There is always someone snapping and pushing forward a different point of view. We see it every day around the world.
Q4. Do you really believe that people can’t hide things anymore?
Philippe Kahn: I think that people can’t hide what they do in public: Brutality, Generosity, Politics, Emotions. We all have a right to privacy. However in public, there is always someone snapping.
Q5. What about fake news?
Philippe Kahn: There is nothing new about Fake News. It’s always been around. What’s new is that with the web omnipresent, it’s much more effective. Add modern powerful editing and publishing tools and sometimes it’s very challenging to differentiate what’s real from what’s fake.
Q6. You told Bob Parks, who interviewed you for a Wired article in 2000: ‘In the future people will document crimes using video on their phones. Then everyone will know the real story.’ Has this really changed our world?
Philippe Kahn: Yes, it has. It’s forced policing for example to re-examine protocols. Of course not every violence or crime is covered, but video and photos are helping victims.
Q7. What are the challenges and opportunities in country like Africa, where people don’t have laptops, but have phones with cameras?
Philippe Kahn: The opportunities are great. Those countries are skipping the laptop and focusing on a Smartphone with a cloud infrastructure. That’s pretty much what I do daily. In fact, this is what I am doing as I am answering these questions.
Q8. Back to the future: you live now in the world of massive firehouses of machine data and AI driven algorithms. How these new technologies will change the world (for the better or the worst)?
Philippe Kahn: There are always two sides to everything: Even shoes can be used to keep me warm or march fascist armies across illegitimately conquered territories. The dangers of AI lie in police states and in a massive focus on an advertising business model. But what we do with AI is helping us find solutions for better sleep, diabetes, high blood pressure, cancer and more. We need to accept one to get the other in some ways.
Q9. In my recent interview with interview Vinton G. Cerf , he expressed great concerns about the safety, security and privacy of IoT devices. He told me “A particularly bad scenario would have a hacker taking over the operating system of 100,000 refrigerators.”
Philippe Kahn: When we build AI-powered IoT solutions at Fullpower, security and privacy are paramount. We follow the strictest protocols. Security and privacy are at risk every day with computer viruses and hacking. Nothing is new. It’s always a game of cat and mouse. I want to believe that we are a great cat. We work hard at it.
Q10. With your new startup, FullPower Technologies, you have developed under-the-mattress sensors and cloud based artificial intelligence to gather data and personalize recommendations to help customers improve their sleep. What do you think of Cerf´s concerns and how can they be mitigated in practice?
Philippe Kahn: Vince’s concerns are legitimate. At Fullpower our privacy, security and anonymity protocols are our #1 focus together with quality, accuracy, reliability and repeatability. We think of what we build as a fortress. We’ve built in security, privacy, preventive maintenance, automated secure trouble shooting.
Qx Anything else you wish to add?
Philippe Kahn: There is a lot of hype about the dangers of IoT and AI. It’s important to understand that nobody is building Blade-Runner style replicants. AI is very good at solving specialized challenges: Like being the best at playing chess, where the rules are clear and simple. AI can’t deal with general purpose intelligence that is necessary for a living creature to prosper. We are all using AI, Machine Learning, Deep Learning, Supervised Learning for simple and useful solutions.
Philippe Kahn is CEO of Fullpower, the creative team behind the AI-powered Sleeptracker IoT Smartbed technology platform and the MotionX Wearable Technology platform. Philippe is a mathematician, scientist, inventor, and the creator of the camera phone, which original 1997 implementation is now with the Smithsonian in Washington, D.C.
– Internet of Things: Safety, Security and Privacy. Interview with Vint G. Cerf, ODBMS Industry Watch, 2017-06-11
– On Artificial Intelligence and Analytics. Interview with Narendra Mulani, ODBMS Industry Watch, 2017-12-08
Follow us on Twitter: @odbmsorg
“We are now seeing a number of our customers in financial services adopt a real-time approach to detecting and preventing fraudulent credit card transactions. With the use of ML integrating into the real-time rules engine within VoltDB, the transaction can be monitored, validated and either rejected or passed, before being completed, saving time and money for both the financial institution and the consumer.”–David Flower.
I have interviewed David Flower, President and Chief Executive Officer of VoltDB. We discussed his strategy for VoltDB, and the main data challenges enterprises face nowadays in performing real-time analytics.
Q1. You joined VoltDB as Chief Revenue Officer last year, and since March 29, 2017 you have been appointment to the role of President and Chief Executive Officer. What is your strategy for VoltDB?
David Flower : When I joined the company we took a step back to really understand our business and move from the start-up phase to growth stage. As with all organizations, you learn from what you have achieved but you also have to be honest with what your value is. We looked at 3 fundamentals;
1) Success in our customer base – industries, use cases, geography
2) Market dynamics
3) Core product DNA – the underlying strengths of our solution, over and above any other product in the market
The outcome of this exercise is we have moved from a generic veneer market approach to a highly focused specialized business with deep domain knowledge. As with any business, you are looking for repeatability into clearly defined and understood market sectors, and this is the natural next phase in our business evolution and I am very pleased to report that we have made significant progress to date.
With the growing demand for massive data management aligned with real-time decision making, VoltDB is well positioned to take advantage of this opportunity.
Q2. VoltDB is not the only in-memory transactional database in the market. What is your unique selling proposition and how do you position VoltDB in the broader database market?
David Flower : The advantage of operating in the database market is the pure size and scale that it offers – and that is also the disadvantage. You have to be able to express your target value. Through our customers and the strategic review we undertook, we are now able to express more clearly what value we have and where, and equally importantly, where we do not play! Our USP’s revolve around our product principles – vast data ingestion scale, full ACID consistency and the ability to undertake real-time decisioning, all supported through a distributed low-latency in-memory architecture, and we embrace traditional RDBMS through SQL to leverage existing market skills, and reduce the associated cost of change. We offer a proven enterprise grade database that is used by some of the World’s leading and demanding brands, a fact that many other companies in our market are unable to do.
Q3. VoltDB was founded in 2009 by a team of database experts, including Dr. Michael Stonebraker (winner of the ACM Turing award). How much of Stonebraker`s ideas are still in VoltDB and what is new?
David Flower : We are both proud and privileged to be associated with Dr. Stonebraker, and his stature in the database arena is without comparison. Mike’s original ideas underpin our product philosophy and our future direction, and he continues to be actively engaged in the business and will always remain a fundamental part of our heritage. Through our internal engineering experts and in conjunction with our customers, we have developed on Mike’s original ideas to bring additional features, functions and enterprise grade capabilities into the product.
Q4. Stonebraker co-founded several other database companies. Before VoltDB, in 2005, Stonebraker co-founded Vertica to commercialize the technology behind C-Store; and after VoltDB, in 2013 he co-founded another company called Tamr. Is there any relationship between Vertica, VoltDB and Tamr (if any)?
David Flower : Mike’s legacy in this field speaks for itself. VoltDB evolved from the Vertica business and while we have no formal ties, we are actively engaged with numerous leading technology companies that enable clients to gain deeper value through close integrations.
Q5. VoltDB is a ground-up redesign of a relational database. What are the main data challenges enterprises face nowadays in performing real-time analytics?
The demand for ‘real-time’ is one of the most challenging areas for many businesses today. Firstly, the definition of real-time is changing. Batch or micro-batch processing is now unacceptable – whether that be for the consumer, customer and in some cases for compliance. Secondly, analytics is also moving from the back-end (post event) to the front-end (in-event or in-process).
The drivers around AI and ML are forcing this even more. The market requirement is now for real-time analytics but what is the value of this if you cannot act on it? This is where VoltDB excels – we enable the action on this data, in process, and when the data/time is most valuable. VoltDB is able to truly deliver on the value of translytics – the combination of real-time transactions with real-time analytics, and we can demonstrate this through real use cases.
Q6. VoltDB is specialized in high-velocity applications that thrive on fast streaming data. What is fast streaming data and why does it matter?
David Flower : As previously mentioned, VoltDB is designed for high volume data streams that require a decision to be taken ‘in-stream’ and is always consistent. Fast streaming data is best defined through real applications – policy management, authentication, billing as examples in telecoms; fraud detection & prevention in finance (such as massive credit card processing streams); customer engagement offerings in media & gaming; and areas such as smart-metering in IoT.
The underlying principle being that the window of opportunity (action) is available in the fast data stream process, and once passed the opportunity value diminishes.
Q7. You have recently announced an “Enterprise Lab Program” to accelerate the impact of real-time data analysis at large enterprise organizations. What is it and how does it work?
David Flower : The objective of the Enterprise Lab Program is to enable organizations to access, test and evaluate our enterprise solution within their own environment and determine the applicability of VoltDB for either the modernization of existing applications or for the support of next gen applications. This comes without restriction, and provides full access to our support, technical consultants and engineering resources. We realize that selecting a database is a major decision and we want to ensure the potential of our product can be fully understood, tested and piloted with access to all our core assets.
Q8. You have been quoted saying that “Fraud is a huge problem on the Internet, and is one of the most scalable cybercrimes on the web today. The only way to negate the impact of fraud is to catch it before a transaction is processed”. Is this really always possible? How do you detect a fraud in practice?
David Flower : With the phenomenal growth in e-commerce and the changing consumer demands for web-driven retailing, the concerns relating to fraud (credit card) are only going to increase. The internet creates the challenge of handling massive transaction volumes, and cyber criminals are becoming ever more sophisticated in their approach.
Traditional fraud models simply were not designed to manage at this scale, and in many cases post-transaction capture is too late – the damage has been done. We are now seeing a number of our customers in financial services adopt a real-time approach to detecting and preventing fraudulent credit card transactions. With the use of ML integrating into the real-time rules engine within VoltDB, the transaction can be monitored, validated and either rejected or passed, before being completed, saving time and money for both the financial institution and the consumer. By using the combination of post- analytics and ML, the most relevant, current and effective set of rules can be applied as the transaction is processed.
Q9. Another area where VoltDB is used is in mobile gaming. What are the main data challenges with mobile gaming platforms?
David Flower : Mobile gaming is a perfect example of fast data – large data streams that require real-time decisioning for in-game customer engagement. The consumer wants the personal interaction but with relevant offers at that precise moment in the game. VoltDB is able to support this demand, at scale and based on the individual’s profile and stage in the application/game. The concept of the right offer, to the right person, at the right time ensures that the user remains loyal to the game and the game developer (company) can maximize its revenue potential through high customer satisfaction levels.
Q11. Can you explain the purpose of VoltDB`s recently announced co-operations with Huawei and Nokia?
David Flower : We have developed close OEM relationships with a number of major global clients, of which Huawei and Nokia are representative. Our aim is to be more than a traditional vendor, and bring additional value to the table, be it in the form of technical innovation, through advanced application development, or in terms of our ‘total company’ support philosophy. We also recognize that infrastructure decisions are critical by nature, and are not made for the short-term.
VoltDB has been rigorously tested by both Huawei and Nokia and was selected for several reasons against some of the world’s leading technologies, but fundamentally because our product works – and works in the most demanding environments providing the capability for existing and next-generation enterprise grade applications.
David Flower brings more than 28 years of experience within the IT industry to the role of President and CEO of VoltDB. David has a track record of building significant shareholder value across multiple software sectors on a global scale through the development and execution of focused strategic plans, organizational development and product leadership.
Before joining VoltDB, David served as Vice President EMEA for Carbon Black Inc. Prior to Carbon Black he held senior executive positions in numerous successful software companies including Senior Vice President International for Everbridge (NASDAQ: EVBG); Vice President EMEA (APM division) for Compuware (formerly NASDAQ: CPWR); and UK Managing Director and Vice President EMEA for Gomez. David also held the position of Group Vice President International for MapInfo Corp. He began his career in senior management roles at Lotus Development Corp and Xerox Corp – Software Division.
David attended Oxford Brookes University where he studied Finance. David retains strong links within the venture capital investment community.
Follow us on Twitter: @odbmsorg
“You can’t get good insights from bad data, and AI is playing an instrumental role in the data preparation renaissance.”–Narendra Mulani
I have interviewed Narendra Mulani, chief analytics officer, Accenture Analytics.
Q1. What is the role of Artificial Intelligence in analytics?
Narendra Mulani: Artificial Intelligence will be the single greatest change driver of our age. Combined with analytics, it’s redefining what’s possible by unlocking new value from data, changing the way we interact with each other and technology, and improving the way we make decisions. It’s giving us wider control and extending our capabilities as businesses and as people.
AI is also the connector and culmination of many elements of our analytics strategy including data, analytics techniques, platforms and differentiated industry skills.
You can’t get good insights from bad data, and AI is playing an instrumental role in the data preparation renaissance.
AI-powered analytics essentially frees talent to focus on insights rather than data preparation which is more daunting with the sheer volume of data available. It helps organizations tap into new unstructured, contextual data sources like social, video and chat, giving clients a more complete view of their customer. Very recently we acquired Search Technologies who possess a unique set of technologies that give ‘context to content’ – whatever its format – and make it quickly accessible to our clients.
As a result, we gain more precise insights on the “why” behind transactions for our clients and can deliver better customer experiences that drive better business outcomes.
Overall, AI-powered analytics will go a long way in allowing the enterprise to find the trapped value that exists in data, discover new opportunities and operate with new agility.
Q2. How can enterprises become ‘data native’ and digital at the core to help them grow and succeed?
Narendra Mulani: It starts with embracing a new culture which we call ‘data native’. You can’t be digital to the core if you don’t embed data at the core. Getting there is no mean feat. The rate of change in technology and data science is exponential, while the rate at which humans can adapt to this change is finite. In order to close the gap, businesses need to democratize data and get new intelligence to the point where it is easily understood and adopted across the organization.
With the help of design-led analytics and app-based delivery, analytics becomes a universal language in the organization, helping employees make data-driven decisions, collaborate across teams and collectively focus efforts on driving improved outomes for the business.
Enterprises today are only using a small fraction of the data available to them as we have moved from the era of big data to the era of all data. The comprehensive, real-time view businesses can gain of their operations from connected devices is staggering.
But businesses have to get a few things right to ensure they go on this journey.
Understanding and embracing convergence of analytics and artificial intelligence is one of them. You can hardly overstate the impact AI will have on mobilizing and augmenting the value in data, in 2018 and beyond. AI will be the single greatest change driver and will have a lasting effect on how business is conducted.
Enterprises also need to be ready to seize new opportunities – and that means using new data science to help shape hypotheses, test and optimize proofs-of-concept and scale quickly. This will help you reimagine your core business and uncover additional revenue streams and expansion opportunities.
All this requires a new level of agility. To help our clients act and respond fast, we support them with our platforms, our people and our partners. Backed by deep analytics expertise, new cloud-based systems and a curated and powerful alliance and delivery network, our priority is architecting the best solution to meet the needs of each client. We offer an as-a-service engagement model and a suite of intelligent industry solutions that enable even greater agility and speed to market.
Q3. Why is machine learning (ML) such a big deal, where is it driving changes today, and what are the big opportunities for it that have not yet been tapped?
Narendra Mulani: Machine learning allows computers to discover hidden or complex patterns in data without explicit programming. The impact this has on the business is tremendous—it accelerates and augments insights discovery, eliminates tedious repetitive tasks, and essentially enables better outcomes. It can be used to do a lot of good for people, from reading a car’s license plate and forcing the driver to slow down, to allowing people to communicate with others regardless of the language they speak, and helping doctors find very early evidence of cancer.
While the potential we’re seeing for ML and AI in general is vast, businesses are still in the infancy of tapping it. Organizations looking to put AI and ML to use today need to be pragmatic. While it can amplify the quality of insights in many areas, it also increases complexity for organizations, in terms of procuring specialized infrastructure or in identifying and preparing the data to train and use AI, and with validating the results. Identifying the real potential and the challenges involved are areas where most companies today lack the necessary experience and skills and need a trusted advisor or partner.
Whenever we look at the potential AI and ML have, we should also be looking at the responsibility that comes with it. Explainable AI and AI transparency are top of mind for many computer scientists, mathematicians and legal scholars.
These are critical subjects for an ethical application of AI – particularly critical in areas such as financial services, healthcare and life sciences – to ensure that data use is appropriate, and to assess the fairness of derived algorithms.
We need recognize that, while AI is science, and science is limitless, there are always risks in how that science is used by humans, and proactively identify and address issues this might cause for people and society.
Narendra Mulani is Chief Analytics Officer of Accenture Analytics, a practice that his passion and foresight have helped shape since 2012.
A connector at the core, Narendra brings machine learning, data science, data engineers and the business closer together across industries and geographies to embed analytics and create new intelligence, democratize data and foster a data native culture.
He leads a global team of industry and function-specific analytics professionals, data scientists, data engineers, analytics strategy, design and visualization experts across 56 markets to help clients unlock trapped value and define new ways to disrupt in their markets. As a leader, he believes in creating an environment that is inspiring, exciting and innovative.
Narendra takes a thoughtful approach to developing unique analytics strategies and uncovering impactful outcomes. His insight has been shared with business and trade media including Bloomberg, Harvard Business Review, Information Management, CIO magazine, and CIO Insight. Under Narendra’s leadership, Accenture’s commitment and strong momentum in delivering innovative analytics services to clients was recognized in Everest Group’s Analytics Business Process Services PEAK Matrix™ Assessment in 2016.
Narendra joined Accenture in 1997. Prior to assuming his role as Chief Analytics Officer, he was the Managing Director – Products North America, responsible for delivering innovative solutions to clients across industries including consumer goods and services, pharmaceuticals, and automotive. He was also managing director of supply chain for Accenture Management Consulting where he led a global practice responsible for defining and implementing supply chain capabilities at a diverse set of Fortune 500 clients.
Narendra graduated with a Bachelor of Commerce degree at Bombay University, where he was introduced to statistics and discovered he understood probability at a fundamental level that propelled him on his destined career path. He went on to receive an MBA in Finance in 1982 as well as a PhD in 1985 focused on Multivariate Statistics, both from the University of Massachusetts. Education remains fundamentally important to him.
As one who logs too many frequent flier miles, Narendra is an active proponent of taking time for oneself to recharge and stay at the top of your game. He practices what he preaches through early rising and active mindfulness and meditation to keep his focus and balance at work and at home. Narendra is involved with various activities that support education and the arts, and is a music enthusiast. He lives in Connecticut with his wife Nita and two children, Ravi and Nikhil.
Follow us on Twitter: @odbmsorg