“Orleans is an open-source programming framework for .NET that simplifies the development of distributed applications, that is, ones that run on many servers in a datacenter.”– Phil Bernstein.
I have interviewed, Phil Bernstein,a well known data base researcher and Distinguished Scientist at Microsoft Research, where he has worked for over 20 years. We discussed his latest project “Orleans”.
Q1. With the project “Orleans” you and your team invented the “Virtual Actor abstraction”. What is it?
Phil Bernstein: Orleans is an open-source programming framework for .NET that simplifies the development of distributed applications, that is, ones that run on many servers in a datacenter. In Orleans, objects are actors, by which we mean that they don’t share memory.
In Orleans, actors are virtual in the same sense as virtual memory: an object is activated on demand, i.e. when one of its methods is invoked. If an object is already active when it’s invoked, the Orleans runtime will use its object directory to find the object and invoke it. If the runtime determines that the object isn’t active, the runtime will choose a server on which to activate the object, invoke the object’s constructor on that server to load its state, invoke the method, and update the object directory so it can direct future calls to the object.
Conversely, an object is deactivated when it hasn’t been invoked for some time. In that case, the runtime calls the object’s deactivate method, which does whatever cleanup is needed before freeing up the object’s runtime resources.
Q2. How is it possible to build distributed interactive applications, without the need to learn complex programming patterns?
Phil Bernstein: The virtual actor model hides distribution from the developer. You write code as if your program runs on one machine. The Orleans runtime is responsible for distributing objects across servers, which is something that doesn’t affect the program logic. Of course, there are performance and fault tolerance implications of distribution.
But Orleans is able to hide them too.
Q3. Building interactive services that are scalable and reliable is hard. How do you ensure that Orleans applications scale-up and are reliable?
Phil Bernstein: The biggest impediment to scaling out an app across servers is to ensure no server is a bottleneck. Orleans does this by evenly distributing the objects across servers. This automatically balances the load.
As for reliability, the virtual actor model makes this automatic. If a server fails, then of course all of the objects that were active on that server are gone. No problem. The Orleans runtime detects the server failure and knows which objects were active on the failed server. So the next time any of those objects is invoked, it takes its usual course of action, that is, it chooses a server on which to activate the object, loads the object, and invokes it.
Q4. What about the object’s state? Doesn’t that disappear when its server fails?
Phil Bernstein: Yes, of course all of the object’s main memory state is lost. It’s up to the object’s methods to save object state persistently, typically just before returning from a method that modifies the object’s state.
Q5. Is this transactional?
Phil Bernstein: No, not yet. We’re working on adding a transaction mechanism. Coming soon.
Q6. Can you give us an example of an Orleans application?
Phil Bernstein: Orleans is used for developing large-scale on-line games. For example, all of the cloud services for Halo 4 and Halo 5, the popular Xbox games, run on Orleans. Example object types are players, game consoles, game instances, weapons caches, and leaderboards. Orleans is also used for Internet of Things, communications, and telemetry applications. All of these applications are naturally actor-oriented, so they fit well with the Orleans programming model.
Q7. Why does the traditional three-tier architecture with stateless front-ends, stateless middle tier and a storage layer have limited scalability?
Phil Bernstein: The usual bottleneck is the storage layer. To solve this, developers add a middle tier to cache some state and thereby reduce the storage load. However, this middle tier loses the concurrency control semantics of storage, and now you have the hard problem of distributed cache invalidation. To enforce storage semantics, Orleans makes it trivial to express cached items as objects. And to avoid concurrency control problems, it routes requests to a single instance of each object, which is ordinarily single-threaded.
Also, a middle-tier cache does data shipping to the storage servers, which can be inefficient. With Orleans, you have an object-oriented cache and do function shipping instead.
Q8. How does Orleans differ from other Actor platforms such as Erlang and Akka?
Phil Bernstein: In Erlang and Akka, the developer controls actor lifecycle. You explicitly create an actor and choose the server on which it’s activated. Fixing the actor’s location at creation time prevents automating load balancing, actor migration, and server failure handling. For example, if an actor fails, you need code to catch the exception and resurrect the actor on another server. In Orleans, this is all automatic.
Another difference is the communications model. Orleans uses asynchronous RPC’s. Erlang and Akka use one-way messages.
Q9. Database people sometimes focus exclusively on the data model and query language, and don’t consider the problem of writing a scalable application on top of the database. How is Orleans addressing this issue?
Phil Bernstein: In a database-centric view, an app is a set of stored procedures with a stateless front-end and possibly a middle-tier cache. To scale out the app with this design, you need to partition the database into finer slices every time you want to add servers. By contrast, if your app runs on servers that are separate from the database, as it does with Orleans, you can add servers to scale out the app without scaling out the storage. This is easier, more flexible, and less expensive. For example, you can run with more app servers during the day when there’s heavier usage and fewer servers at night when the workload dies down. This is usually infeasible at the database server layer, since it would require migrating parts of the database twice a day.
Q10. Why did you transfer the core Orleans technology to 343 Industries ?
Phil Bernstein: Orleans was developed in Microsoft Research starting in 2009. Like any research project, after several years of use in production, it was time to move it into a product group, which can better afford the resources to support it. Initially, that was 343 Industries, the biggest Orleans user, which ships the Halo game. After Halo 5 shipped, the Orleans group moved to the parent organization, Microsoft Game Studios, which provides technology to Halo and many other Xbox games.
In Microsoft Research, we are still working on Orleans technology and collaborate closely with the product group. For example, we recently published code to support geo-distributed applications on Orleans, and we’re currently working on adding a transaction mechanism.
Q11. The core Orleans technology was also made available as open source in January 2015. Are developers actively contributing to this?
Phil Bernstein: Yes, there is a lot of activity, with contributions from developers both inside and outside Microsoft. You can see the numbers on GitHub – roughly 25 active contributors and over 25 more occasional contributors – with fully-tested releases published every couple of months. After the core .NET runtime and Roslyn compiler projects, Orleans is the next most popular .NET Foundation project on GitHub.
Phil Bernstein is a Distinguished Scientist at Microsoft Research, where he has worked for over 20 years. Before Microsoft, he was a product architect and researcher at Digital Equipment Corp. and a professor at Harvard University. He has published over 150 papers and two books on the theory and implementation of database systems, especially on transaction processing and data integration, which are still the major areas of his work. He is an ACM Fellow, a winner of the ACM SIGMOD Innovations Award, a member of the Washington State Academy of Sciences and a member of the U.S. National Academy of Engineering. He received a B.S. degree from Cornell and M.Sc. and Ph.D. from University of Toronto.
On the Industrial Internet of Things. Interview with Leon Guzenda ODBMS Industry Watch, Published on 2016-01-28
Challenges and Opportunities of The Internet of Things. Interview with Steve Cellini ODBMS Industry Watch, Published on 2015-10-07
Follow ODBMS.org on Twitter: @odbmsorg
“Apart from security, the single biggest new challenges that the Industrial Internet of Things poses are the number of devices involved, the rate that many of them can generate data and the database and analytical requirements.” –Leon Guzenda.
I have interviewed Leon Guzenda, Chief Technical Marketing Officer at Objectivity. Topics of the interview are data analytics, the Industrial Internet of Things (IIoT), and ThingSpan.
Q1. What is the difference between Big Data and Fast Data?
Leon Guzenda: Big Data is a generic term for datasets that are too large or complex to be handled with traditional technology. Fast Data refers to streams of data that must be processed or acted upon immediately once received.
If most, or all, of it is stored, it will probably end up as Big Data. Hadoop standardized the parallel processing approach for Big Data, and HDFS provided a resilient storage infrastructure. Meanwhile, Complex Event Processing became the main way of dealing with fast-moving streams of data, applying business logic and triggering event processing. Spark is a major step forward in controlling workflows that have streaming, batch and interactive elements, but it only offers a fairly primitive way to bridge the gap between the Fast and Big Data worlds via tabular RDDs or DataFrames.
ThingSpan, Objectivity’s new information fusion platform, goes beyond that. It integrates with Spark Streaming and HDFS to provide a dynamic Metadata Store that holds information about the many complex relationships between the objects in the Hadoop repository or elsewhere. It can be used to guide data mining using Spark SQL or GraphX and analytics using Spark MLlib.
Q2. Shawn Rogers, Chief Research Officer, Dell Statistica recently said in an interview: “A ‘citizen data scientist’ is an everyday, non-technical user that lacks the statistical and analytical prowess of a traditional data scientist, but is equally eager to leverage data in order to uncovering insights, and importantly, do so at the speed business”. What is your take on this?
Leon Guzenda: It’s a bit like the difference between amateur and professional astronomers.
There are far more data users than trained data scientists, and it’s important that the data users have all of the tools needed to extract value from their data. Things filter down from the professionals to the occasional users. I’ve heard the term “NoHow” applied to tools that make this possible. In other words, the users don’t have to understand the intricacy of the algorithms. They only need to apply them and interpret the results. We’re a long way from that with most kinds of data, but there is a lot of work in this area.
We are making advances in visual analytics, but there is also a large and rapidly growing set of algorithms that the tool builders need to make available. Users should be able to define their data sources, say roughly what they’re looking for and let the tool assemble the workflow and visualizers. We like the idea of “Citizen Data Scientists” being able to extract value from their data more efficiently, but let’s not forget that data blending at the front end is still a challenge and may need some expert help.
That’s another reason why the ThingSpan Metadata Store is important. An expert can describe the data there in terms that are familiar to the user. Applying the wrong analytical algorithm can produce false patterns, particularly when the data has been sampled inadequately. Once again, having an expert constrain those of particular algorithms to certain types of data can make it much more likely that the Citizen Data Scientists will obtain useful results.
Q3. Do we really need the Internet of Things?
Leon Guzenda: That’s a good question. It’s only worth inventing a category if the things that it applies to are sufficiently different from other categories to merit it. If we think of the Internet as a network of connected networks that share the same protocol, then it isn’t necessary to define exactly what each node is. The earliest activities on the Internet were messaging, email and file sharing. The WWW made it possible to set up client-server systems that ran over the Internet. We soon had “push” systems that streamed messages to subscribers rather than having them visit a site and read them. One of the fastest growing uses is the streaming of audio and video. We still haven’t overcome some of the major issues associated with the Internet, notably security, but we’ve come a long way.
Around the turn of the century it became clear that there are real advantages in connecting a wider variety of devices directly to each other in order to improve their effectiveness or an overall system. Separate areas of study, such as smart power grids, cities and homes, each came to the conclusion that new protocols were needed if there were no humans tightly coupled to the loop. Those efforts are now converging to the discipline that we call the Internet of Things (IoT), though you only have to walk the exhibitor hall at any IoT conference to find that we’re at about the same point as we were in the early NoSQL conferences. Some companies have been tackling the problems for many years whilst others are trying to bring value by making it easier to handle connectivity, configuration, security, monitoring, etc.
The Industrial IoT (IIoT) is vital, because it can help improve our quality of life and safety whilst increasing the efficiency of the systems that serve us. The IIoT is a great opportunity for some of the database vendors, such as Objectivity, because we’ve been involved with companies or projects tackling these issues for a couple of decades, notably in telecoms, process control, sensor data fusion, and intelligence analysis. New IoT systems generally need to store data somewhere and make it easy to analyze. That’s what we’re focused on, and why we decided to build ThingSpan, to leverage our existing technology with new open source components to enable real-time relationship and pattern discovery of IIoT applications.
Q4. What is special about the Industrial Internet of Things? And what are the challenges and opportunities in this area?
Leon Guzenda:. Apart from security, the single biggest new challenges that the IIoT poses are the number of devices involved, the rate that many of them can generate data and the database and analytical requirements. The number of humans on the planet is heading towards eight billion, but not all of them have Internet access. The UN expects that there will be around 11 billion of us by 2100. There are likely to be around 25 billion IIoT devices by 2020.
There is growing recognition and desire by organizations to better utilize their sensor-based data to gain competitive advantage. According to McKinsey & Co., organizations in many industry segments are currently using less than 5% of data from their sensors. Better utilization of sensor-based data could lead to a positive impact of up to $11.1 Trillion per year by 2025 through improved productivities.
Q5. Could you give us some examples of predictive maintenance and asset management within the Industrial IoT?
Leon Guzenda: Yes, neither use case is new nor directly the result of the IIoT, but the IIoT makes it easier to collect, aggregate and act upon information gathered from devices. We have customers building telecom, process control and smart building management systems that aggregate information from multiple customers in order to make better predictions about when equipment should be tweaked or maintained.
One of our customers provides systems for conducting seismic surveys for oil and gas companies and for helping them maximize the yield from the resources that they discover. A single borehole can have 10,000 sensors in the equipment at the site.
That’s a lot of data to process in order to maintain control of the operation and avoid problems. Replacing a broken drill bit can take one to three days, with the downtime costing between $1 million and $3.5 million. Predictive maintenance can be used to schedule timely replacement or servicing of the drill bit, reducing the downtime to three hours or so.
There are similar case studies across industries. The CEO of one of the world’s largest package transportation companies said recently that saving a single mile off of every driver’s route resulted in savings of $50 million per year! Airlines also use predictive maintenance to service engines and other aircraft parts to keep passengers safely in the air, and mining companies use GPS tracking beacons on all of their assets to schedule the servicing of vital and very costly equipment optimally. Prevention is much better than treatment when it comes to massive or expensive equipment.
Q6. What is ThingSpan? How is it positioned in the market?
Leon Guzenda: ThingSpan is an information fusion software platform, architected for performance and extensibility, to accelerate time-to-production of IoT applications. ThingSpan is designed to seat between streaming analytics platforms and Big Data platforms in the Fast Data pipeline to create contextual information in the form of transformed data and domain metadata from streaming data and static, historical data. Its main differentiators from other tools in the field are its abilities to handle concurrent high volume ingest and pathfinding query loads.
ThingSpan is built around object-data management technology that is battle-tested in data fusion solutions in production use with U.S. government and Fortune 1000 organizations. It provides out-of-the-box integration with Spark and Hadoop 2.0 as well as other major open source technologies. Objectivity has been bridging the gap between Big Data and Fast Data within the IIoT for leading government agencies and commercial enterprises for decades, in industries such as manufacturing, oil and gas, utilities, logistics and transportation, and telecommunications. Our software is embedded as a key component in several custom IIoT applications, such as management of real-time sensor data, security solutions, and smart grid management.
Q7. Graphs are hard to scale. How do you handle this in ThingSpan?
Leon Guzenda: ThingSpan is based on our scalable, high-performance, distributed object database technology. ThingSpan isn’t constrained to graphs that can be handled in memory, nor is it dependent upon messaging between vertices in the graph. The address space could be easily expanded to the Yottabyte range or beyond, so we don’t expect any scalability issues. The underlying kernel handles difficult tasks, such as pathfinding between nodes, so performance is high and predictable. Supplementing ThingSpan’s database capabilities with the algorithms available via Spark GraphX makes it possible for users to handle a much broader range of tasks.
We’ve also noted over the years that most graphs aren’t as randomly connected as you might expect. We often see clusters of subgraphs, or dandelion-like structures, that we can use to optimize the physical placement of portions of the graph on disk. Having said that, we’ve also done a lot of work to reduce the impact of supernodes (ones with extremely large numbers of connections) and to speed up pathfinding in the cases where physical clustering doesn’t work.
Q8. Could you describe how ThingSpan’s graph capabilities can be beneficial for use cases, such as cybersecurity, fraud detection and anti-money laundering in financial services, to name a few?
Leon Guzenda: Each of those use cases, particularly cybersecurity, deals with fast-moving streams of data, which can be analyzed by checking thresholds in individual pieces of data or accumulated statistics. ThingSpan can be used to correlate the incoming (“Fast”) data that is handled by Spark Streaming with a graph of connections between devices, people or institutions. At that point, you can recognize Denial of Service attacks, fraudulent transactions or money laundering networks, all of which will involve nodes representing suspicious people or organizations.
The faster you can do this, the more chance you have of containing a cybersecurity threat or preventing financial crimes.
Q9. Objectivity has traditionally focused on a relatively narrow range of verticals. How do you intend to support a much broader range of markets than your current base?
Leon Guzenda: Our base has evolved over the years and the number of markets has expanded since the industry’s adoption of Java and widespread acceptance of NoSQL technology. We’ve traditionally maintained a highly focused engineering team and very responsive product support teams at our headquarters and out in the field. We have never attempted to be like Microsoft or Apple, with huge teams of customer service people handling thousands of calls per day. We’ve worked with VARs that embed our products in their equipment or with system integrators that build highly complex systems for their government and industry customers.
We’re expanding this approach with ThingSpan by working with the open source community, as well as building partnerships with technology and service providers. We don’t believe that it’s feasible or necessary to suddenly acquire expertise in a rapidly growing range of disciplines and verticals. We’re happy to hand much of the service work over to partners with the right domain expertise while we focus on strengthening our technologies. We recently announced a technology partnership with Intel via their Trusted Analytics Platform (TAP) initiative. We’ll soon be announcing certification by key technology partners and the completion of major proof of concept ThingSpan projects. Each of us will handle a part of a specific project, supporting our own products or providing expertise and working together to improve our offerings.
Leon Guzenda, Chief Technical Marketing Officer at Objectivity
Leon Guzenda was one of the founding members of Objectivity in 1988 and one of the original architects of Objectivity/DB.
He currently works with Objectivity’s major customers to help them effectively develop and deploy complex applications and systems that use the industry’s highest-performing, most reliable DBMS technology, Objectivity/DB. He also liaises with technology partners and industry groups to help ensure that Objectivity/DB remains at the forefront of database and distributed computing technology.
Leon has more than five decades of experience in the software industry. At Automation Technology Products, he managed the development of the ODBMS for the Cimplex solid modeling and numerical control system.
Before that, he was Principal Project Director for International Computers Ltd. in the United Kingdom, delivering major projects for NATO and leading multinationals. He was also design and development manager for ICL’s 2900 IDMS database product. He spent the first 7 years of his career working in defense and government systems. Leon has a B.S. degree in Electronic Engineering from the University of Wales.
– What is data blending. By Oleg Roderick, David Sanchez, Geisinger Data Science, ODBMS.org, November 2015
-￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼ Industrial Internet of Things: Unleashing the Potential of Connected Products and Services. World Economic Forum. January 2015
– Can Columnar Database Systems Help Mathematical Analytics? by Carlos Ordonez, Department of Computer Science, University of Houston. ODBMS.org, 23 JAN, 2016.
–The Managers Who Stare at Graphs. By Christopher Surdak, JD. ODBMS.org, 23 SEP, 2015.
– From Classical Analytics to Big Data Analytics. by Peter Weidl, IT-Architect, Zürcher Kantonalbank. ODBMS.org,11 AUG, 2015
– Streamlining the Big Data Landscape: Real World Network Security Usecase. By Sonali Parthasarathy Accenture Technology Labs. ODBMS.org, 2 JUL, 2015.
Follow ODBMS.org on Twitter: @odbmsorg
“We have a profound ethical responsibility to design systems that have a positive impact on society, obey the law, and adhere to our highest ethical standards.”–Oren Etzioni.
On the impact of Artificial Intelligence (AI) on society, I have interviewed Oren Etzioni, Chief Executive Officer of the Allen Institute for Artificial Intelligence.
Q1. What is the mission of the Allen Institute for AI (AI2)?
Oren Etzioni: Our mission is to contribute to humanity through high-impact AI research and engineering.
Q2. AI2 is the creation of Paul Allen, Microsoft co-founder, and you are the lead. What role Paul Allen has in AI2, and what is your responsibility?
Oren Etzioni AI2 is based on Paul Allen’s vision, and he leads our Board of Directors and is closely involved in setting our technical agenda. My job is to work closely with Paul and to recruit & lead our team to execute against our ambitious goals.
Q3. Driverless cars, digital Personal Assistants (e.g. Siri), Big Data, the Internet of Things, Robots: Are we on the brink of the next stage of the computer revolution?
Oren Etzioni: Yes, but never mistake a clear view for a short distance—it will take some time.
Q4. Do you believe that AI will transform modern life? How?
Oren Etzioni: Yes, within twenty years—every aspect of human life will transformed. Driving will become a hobby; medicine and science will be transformed by AI Assistants. There will even be robotic sex.
Q5. John Markoff in his book Machines of Loving Grace, reframes a question first raised more than half century ago, when the intelligent machine was born: Will we control these intelligent systems, or will they control us? What is your opinion on this?
Oren Etzioni: It is absolutely essential that we control the machines, and every indication is that we will be able to do so in the foreseeable future. I do worry about human motivations too. Someone said: I‘m not worried about robots deciding to kill people, I’m worried about politicians deciding robots should kill people.
Q6. If we delegate decisions to machines, who will be responsible for the consequences?
Oren Etzioni: Of course we are responsible. That is already true today when we use a car, when we fire a weapon—nothing will change in terms of responsibility. “My robot did it” is not an excuse for anything.
Q7. What are the ethical responsibilities of designers of intelligent systems?
Oren Etzioni: We have a profound ethical responsibility to design systems that have a positive impact on society, obey the law, and adhere to our highest ethical standards.
Q8. What are the current projects at AI2?
Oren Etzioni: We have four primary projects in active development at AI2.
- Aristo: Aristo is a system designed to acquire and store a vast amount of computable knowledge, then apply this knowledge to reason through and answer a variety of science questions from standardized exams for students in multiple grade levels. Aristo leverages machine reading, natural language processing, and diagram interpretation both to expand its knowledge base and to successfully understand exam questions, allowing the system to apply the right knowledge to predict or generate the right answers.
- Semantic Scholar: Semantic Scholar is a powerful tool for searching over large collections of academic papers. S2 leverages our AI expertise in data mining, natural-language processing, and computer vision to help researchers efficiently find relevant information. We can automatically extract the authors, venues, data sets, and figures and graphs from each paper and use this information to generate useful search and discovery experiences. We started with computer science in 2015, and we plan to scale the service to additional scientific areas over the next few years.
- Euclid: Euclid is focused on solving math and geometry problems. Most recently we created GeoS, an end-to-end system that uses a combination of computer vision to interpret diagrams, natural language processing to read and understand text, and a geometric solver to achieve 49 percent accuracy on official SAT test questions. We are continuing to expand and improve upon the different components of GeoS to improve its performance and expand its capabilities.
- Plato: Plato is focused on automatically generating novel knowledge from visual data, including videos, images, and diagrams, and exploring ways to supplement and integrate that knowledge with complementary text data. There are several sub-projects within Plato, including work on predicting the motion dynamics of objects in a given image, the development of a fully automated visual encyclopedia, and a visual knowledge extraction system that can answer questions about proposed relationships between objects or scenes (e.g. “do dogs eat ice cream?”) by using scalable visual verification.
Q9. What research areas are most promising for the next three years at AI2?
Oren Etzioni: We are focused on Natural Language, Machine learning, and Computer Vision.
Oren Etzioni: We have just launched Semantic Scholar —which leverages AI methods to revolutionize the search for computer science papers and articles.
Qx Anything else you with to add?
Dr. Oren Etzioni is Chief Executive Officer of the Allen Institute for Artificial Intelligence.
He has been a Professor at the University of Washington’s Computer Science department since 1991, receiving several awards including GeekWire’s Hire of the Year (2014), Seattle’s Geek of the Year (2013), the Robert Engelmore Memorial Award (2007), the IJCAI Distinguished Paper Award (2005), AAAI Fellow (2003), and a National Young Investigator Award (1993).
He was also the founder or co-founder of several companies including Farecast (sold to Microsoft in 2008) and Decide (sold to eBay in 2013), and the author of over 100 technical papers that have garnered over 23,000 citations.
The goal of Oren’s research is to solve fundamental problems in AI, particularly the automatic learning of knowledge from text. Oren received his Ph.D. from Carnegie Mellon University in 1991, and his B.A. from Harvard in 1986.
MACHINES OF LOVING GRACE– The Quest for Common Ground Between Humans and Robots. By John Markoff.
Illustrated. 378 pp. Ecco/HarperCollins Publishers.
Big Data: A Revolution That Will Transform How We Live, Work and Think. Mayer-Schönberger, V. and Cukier, K. (2013)
– On Big Data and Society. Interview with Viktor Mayer-Schönberger, ODBMS Industry Watch, Published on 2016-01-08
Follow ODBMS.org on Twitter: @odbmsorg.
“There is potentially too much at stake to delegate the issue of control to individuals who are neither aware nor knowledgable enough about how their data is being used to raise alarm bells and sue data processors.”–Viktor Mayer-Schönberger.
On Big Data and Society, I have interviewed Viktor Mayer-Schönberger, Professor of Internet Governance and Regulation at Oxford University (UK).
Happy New Year!
Q1. Is big data changing people’s everyday world in a tangible way?
Viktor Mayer-Schönberger: Yes, of course. Most of us search online regularly. Internet search engines would not work nearly as well without Big Data (and those of us old enough to remember the Yahoo menus of the 1990s know how difficult it was then to find anything online). We would not have recommendation engines helping us find the right product (and thus reducing inefficient transaction costs), nor would flying in a commercial airplane be nearly as safe as it is today.
Q2. You mentioned in your recent book with Kenneth Cukier, Big Data: A Revolution That Will Transform How We Live Work and Think, that the fundamental shift is not in the machines that calculate data but in the data itself and how we use it. But what about people?
Viktor Mayer-Schönberger: I do not think data has agency (in contrast to Latour), so of course humans are driving the development. The point we were making is that the source of value isn’t the huge computing cluster or the smart statistical algorithm, but the data itself. So when for instance asking about the ethics of Big Data it is wrong to focus on the ethics of algorithms, and much more appropriate to focus on the ethics of data use.
Q3. What is more important people`s good intention or good data?
Viktor Mayer-Schönberger: This is a bit like asking whether one prefers apples or sunshine. Good data (being comprehensive and of high quality) reflects reality and thus can help us gain insights into how the world works. That does not make such discovery ethical, even though the discover is correct. Good intentions point towards an ethical use of data, which helps protect us again unethical data uses, but does not prevent false big data analysis. This is a long way of saying we need both, albeit for different reasons.
Q4. What are your suggestion for concrete steps that can be taken to minimize and mitigate big data’s risk?
Viktor Mayer-Schönberger: I have been advocating ex ante risk assessments of big data uses, rather than (as at best we have today) ex post court action. There is potentially too much at stake to delegate the issue of control to individuals who are neither aware nor knowledgable enough about how their data is being used to raise alarm bells and sue data processors. This is not something new. There are many areas of modern life that are so difficult and intransparent for individuals to control that we have delegated control to competent government agencies.
For instance, we don’t test the food in supermarkets ourselves for safety, nor do we crash-test cars before we buy them (or Tv sets, washing machines or microwave ovens), or run our own drug trials.
In all of these cases we put in place stringent regulation that has at its core a suitable process of risk assessment, and a competent agency to enforce it. This is what we need for Big Data as well.
Q5. Do you believe is it possible to ensure transparency, guarantee human freewill, and strike a better balance on privacy and the use of personal information?
Viktor Mayer-Schönberger: Yes, I do believe that. Clearly, today we are getting not enough transparency, and there aren’t sufficiently effective guarantees for free will and privacy in place. So we can do better. And we must.
Q6. You coined in your book the terms “propensity” and “fetishization” of data. What do you mean with these terms?
Viktor Mayer-Schönberger: I don’t think we coined the term “propensity”. It’s an old term denoting the likelihood of something happening. With the “fetishization of data” we meant the temptation (in part caused by our human bias towards causality – understanding the world around us as a sequence of causes and effects) to imbue the results of Big Data analysis with more meaning than they deserve, especially suggesting that they tell us why when they only tell us what.
Q7. Can big and open data be effectively used for the common good?
Viktor Mayer-Schönberger: Of course. Big Data is at its core about understanding the world better than we do today. I would not be in the academy if I did not believe strongly that knowledge is essential for human progress.
Q8. Assuming there is a real potential in using data–driven methods to both help charities develop better services and products, and understand civil society activity. What are the key lessons and recommendations for future work in this space?
Viktor Mayer-Schönberger: My sense is that we need to hope for two developments. First, that more researchers team up with decision makers in charities, and more broadly civil society organizations (and the government) to utilize Big Data to improve our understanding of the key challenges that our society is facing. We need to improve our understanding. Second, we also need decision makers and especially policy makers to better understand the power of Big Data – they need to realize that for their decision making data is their friend; and they need to know that especially here in Europe, the cradle of enlightenment and modern science, data-based rationality is the antidote to dangerous beliefs and ideologies.
Q9. What are your current areas of research?
Viktor Mayer-Schönberger: I have been working on how Big Data is changing learning and the educational system, as well as how Big Data changes the process of discovery, and how this has huge implications, for instance in the medical field.
Viktor Mayer-Schönberger is Professor of Internet Governance and Regulation at Oxford University. In addition to the best-selling “Big Data” (with Kenneth Cukier), Mayer-Schönberger has published eight books, including the awards-winning “Delete: The Virtue of Forgetting in the Digital Age” and is the author of over a hundred articles and book chapters on the information economy. He is a frequent public speaker, and his work have been featured in (among others) New York Times, Wall Street Journal, Financial Times, The Economist, Nature and Science.
Mayer-Schönberger, V. and Cukier, K. (2013) Big Data: A Revolution That Will Transform How We Live, Work and Think. John Murray.
Mayer-Schönberger, V. (2009) Delete – The Virtue of Forgetting in the Digital Age. Princeton University Press.
Follow ODBMS.org on Twitter: @odbmsorg.
“Really, I would say this is indeed the essence of Big Data – being able to harness data from millions of endpoints whether they be devices or users, and optimizing outcomes for the individual, not just for the collective!”–Shilpa Lawande.
I have been following Vertica since their acquisition by HP back in 2011. This is my third interview with Shilpa Lawande, now Vice President at Hewlett Packard Enterprise, and responsible for strategic direction of the HP Big Data Platforms, including HP Vertica Analytic Platform.
The first interview I did with Shilpa was back on November 16, 2011 (soon after the acquisition by HP), and the second on July 14, 2014.
If you read the three interviews (see links to the two previous interviews at the end of this interview), you will notice how fast the Big Data Analytics and Data Platforms world is changing.
Q1. What are the main technical challenges in offering data analytics in real time? And what are the main problems which occur when trying to ingest and analyze high-speed streaming data, from various sources?
Shilpa Lawande: Before we talk about technical challenges, I would like to point out the difference between two classes of analytic workloads that often get grouped under “streaming” or “real-time analytics”.
The first and perhaps more challenging workload deals with analytics at large scale on stored data but where new data may be coming in very fast, in micro-batches.
In this workload, challenges are twofold – the first challenge is about reducing the latency between ingest and analysis, in other words, ensuring that data can be made available for analysis soon after it arrives, and the second challenge is about offering rich, fast analytics on the entire data set, not just the latest batch. This type of workload is a facet of any use case where you want to build reports or predictive models on the most up-to-date data or provide up-to-date personalized analytics for a large number of users, or when collecting and analyzing data from millions of devices. Vertica excels at solving this problem at very large petabyte scale and with very small micro-batches.
The second type of workload deals with analytics on data in flight (sometimes called fast data) where you want to analyze windows of incoming data and take action, perhaps to enrich the data or to discard some of it or to aggregate it, before the data is persisted. An example of this type of workload might be taking data coming in at arbitrary times with granularity and keeping the average, min, and max data points per second, minute, hour for permanent storage. This use case is typically solved by in-memory streaming engines like Storm or, in cases where more state is needed, a NewSQL system like VoltDB, both of which we consider complementary to Vertica.
Q2. Do you know of organizations that already today consume, derive insight from, and act on large volume of data generated from millions of connected devices and applications?
Shilpa Lawande: HP Inc. and Hewlett Packard Enterprise (HPE) are both great examples of this kind of an organization. A number of our products – servers, storage, and printers all collect telemetry about their operations and bring that data back to analyze for purposes of quality control, predictive maintenance, as well as optimized inventory/parts supply chain management.
We’ve also seen organizations collect telemetry across their networks and data centers to anticipate servers going down, as well as to have better understanding of usage to optimize capacity planning or power usage. If you replace devices by users in your question, online and mobile gaming companies, social networks and adtech companies with millions of daily active users all collect clickstream data and use it for creating new and unique personalized experiences. For instance, user churn is a huge problem in monetizing online gaming.
If you can detect, from the in-game interactions, that users are losing interest, then you can immediately take action to hold their attention just a little bit longer or to transition them to a new game altogether. Companies like Game Show Network and Zynga do this masterfully using Vertica real-time analytics!
Really, I would say this is indeed the essence of Big Data – being able to harness data from millions of endpoints whether they be devices or users, and optimizing outcomes for the individual, not just for the collective!
Q3. Could you comment on the strategic decision of HP to enhance its support for Hadoop?
Shilpa Lawande: As you know HP recently split into Hewlett Packard Enterprise (HPE) and HP Inc.
With HPE, which is where Big Data and Vertica resides, our strategy is to provide our customers with the best end-to-end solutions for their big data problems, including hardware, software and services. We believe that technologies Hadoop, Spark, Kafka and R are key tools in the Big Data ecosystem and the deep integration of our technology such as Vertica and these open-source tools enables us to solve our customers’ problems more holistically.
At Vertica, we have been working closely with the Hadoop vendors to provide better integrations between our products.
Some notable, recent additions include our ongoing work with Hortonworks to provide an optimized Vertica SQL-on-Hadoop version for the Orcfile data format, as well as our integration with Apache Kafka.
Q4. The new version of HPE Vertica, “Excavator,” is integrated with Apache Kafka, an open source distributed messaging system for data streaming. Why?
Shilpa Lawande: As I mentioned earlier, one of the challenges with streaming data is ingesting it in micro- batches at low latency and high scale. Vertica has always had the ability to do so due to its unique hybrid load architecture whereby data is ingested into a Write Optimized Store in-memory and then optimized and persisted to a Read-Optimized Store on disk.
Before “Excavator,” the onus for engineering the ingest architecture was on our customers. Before Kafka, users were writing custom ingestion tools from scratch using ODBC/JDBC or staging data to files and then loading using Vertica’s COPY command. Besides the challenges of achieving the optimal load rates, users commonly ran into challenges of ensuring transactionality of the loads, so that each batch gets loaded exactly once even under esoteric error conditions. With Kafka, users get a scalable distributed messaging system that enables simplifying the load pipeline.
We saw the combination of Vertica and Kafka becoming a common design pattern and decided to standardize on this pattern by providing out-of-the-box integration between Vertica and Kafka, incorporating the best practices of loading data at scale. The solution aims to maximize the throughput of loads via micro-batches into Vertica, while ensuring transactionality of the load process. It removes a ton of complexity in the load pipeline from the Vertica users.
Q5.What are the pros and cons of this design choice (if any)?
Shilpa Lawande: The pros are that if you already use Kafka, much of the work of ingesting data into Vertica is done for you. Having seen so many different kinds of ingestion horror stories over the past decade, trust me, we’ve eliminated a ton of complexity that you don’t need to worry about anymore. The cons are, of course, that we are making the choice of the tool for you. We believe that the pros far outweigh any cons.
Q6. What kind of enhanced SQL analytics do you provide?
Shilpa Lawande: Great question. Vertica of course provides all the standard SQL analytic capabilities including joins, aggregations, analytic window functions, and, needless to say, performance that is a lot faster than any other RDBMS. But we do much more than that. We’ve built some unique time-series analysis (via SQL) to operate on event streams such as gap-filling and interpolation and event series joins. You can use this feature to do common operations like sessionization in three or four lines of SQL. We can do this because data in Vertica is always sorted and this makes Vertica a superior system for time series analytics. Our pattern matching capabilities enable user path or marketing funnel analytics using simple SQL, which might otherwise take pages of code in Hive or Java.
With the open source Distributed R engine, we provide predictive analytical algorithms such as logistic regression and page rank. These can be used to build predictive models using R, and the models can be registered into Vertica for in- database scoring. With Excavator, we’ve also added text search capabilities for machine log data, so you can now do both search and analytics over log data in one system. And you recently featured a five-part blog series by Walter Maguire examining why Vertica is the best graph analytics engine out there.
Q7. What kind of enhanced performance to Hadoop do you provide?
Shilpa Lawande We see Hadoop, particularly HDFS, as highly complementary to Vertica. Our users often use HDFS as their data lake, for exploratory/discovery phases of their data lifecycle. Our Vertica SQL on Hadoop offering includes the Vertica engine running natively on Hadoop nodes, providing all the advanced SQL capabilities of Vertica on top of data stored in HDFS. We integrate with native metadata stores like HCatalog and can operate on file formats like Orcfiles, Parquet, JSON, Avro, etc. to provide a much more robust SQL engine compared to the alternatives like Hive, Spark or Impala, and with significantly better performance. And, of course, when users are ready to operationalize the analysis, they can seamlessly load the data into Vertica Enterprise which provides the highest performance, compression, workload management, and other enterprise capabilities for your production workloads. The best part is that you do not have to rewrite your reports or dashboards as you move data from Vertica for SQL on Hadoop to Vertica Enterprise.
Qx Anything else you wish to add?
Shilpa Lawande: As we continue to develop the Vertica product, our goal is to provide the same capabilities in a variety of consumption and deployment models to suit different use cases and buying preferences. Our flagship Vertica Enterprise product can be deployed on-prem, in VMWare environments or in AWS via an AMI.
Our SQL on Hadoop product can be deployed directly in Hadoop environments, supporting all Hadoop distributions and a variety of native data formats. We also have Vertica OnDemand, our data warehouse-as-a-service subscription that is accessible via a SQL prompt in AWS, HPE handles all of the operations such as database and OS software updates, backups, etc. We hope that by providing the same capabilities across many deployment environments and data formats, we provide our users the maximum choice so they can pick the right tool for the job. It’s all based on our signature core analytics engine.
We welcome new users to our growing community to download our Community Edition, which provides 1TB of Vertica on a three-node cluster for free, or sign-up for a 15-day trial of Vertica on Demand!
Shilpa Lawande is Vice President at Hewlett Packard Enterprise, responsible for strategic direction of the HP Big Data Platforms, including the flagship HP Vertica Analytic Platform. Shilpa brings over 20 years of experience in databases, data warehousing, analytics and distributed systems.
She joined Vertica at its inception in 2005, being one of the original engineers who built Vertica from ground up, and running the Vertica Engineering and Customer Experience teams for better part of the last decade. Shilpa has been at HPE since 2011 through the acquisition of Vertica and has held a diverse set of roles spanning technology and business.
Prior to Vertica, she was a key member of the Oracle Server Technologies group where she worked directly on several data warehousing and self-managing features in the Oracle Database.
Shilpa is a co-inventor on several patents on database technology, both at Oracle and at HP Vertica.
She has co-authored two books on data warehousing using the Oracle database as well as a book on Enterprise Grid Computing.
She has been named to the 2012 Women to Watch list by Mass High Tech, the Rev Boston 2015 list, and awarded HP Software Business Unit Leader of the year in 2012 and 2013. As a working mom herself, Shilpa is passionate about STEM education for Girls and Women In Tech issues, and co-founded the Datagals women’s networking and advocacy group within HPE. In her spare time, she mentors young women at Year Up Boston, an organization that empowers low-income young adults to go from poverty to professional careers in a single year.
- Uplevel Big Data analytics with HP Vertica – Part 1: Graph in a relational database? Seriously? by Walter Maguire
- Uplevel Big Data Analytics with Graph in Vertica – Part 2: Yes, you can write that in SQL by Walter Maguire
- Uplevel Big Data Analytics with Graph in Vertica – Part 3: Yes, you can make it go even faster by Walter Maguire
- Uplevel Big Data Analytics with Graph in Vertica – Part 4: It’s not your dad’s graph engine by Walter Maggiore
- Uplevel Big Data Analytics with Graph in Vertica – Part 5: Putting graph to work for your business by Walter Maguire
Follow ODBMS.org on Twitter: @odbmsorg
“Topdown cataloging and masterdata management tools typically require expensive data curators, and are not simple to use. This poses a significant threat to cataloging efforts since so much knowledge about your organization’s data is inevitably clustered across the minds of the people who need to question it and the applications they use to answer those questions.”–Gideon Goldin
I have interviewed Gideon Goldin, UX Architect, Product Manager at Tamr.
Q1. What is “dark data”?
Gideon Goldin: Gartner refers to dark data as “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing).” For most organizations, dark data comprises the majority of available data, and it is often the result of the constantly changing and unpredictable nature of enterprise data something that is likely to be exacerbated by corporate restructuring, M&A activity, and a number of external factors.
By shedding light on this data, organizations are better suited to make more datadriven, accurate business decisions.
Tamr Catalog, which is available as a free downloadable app, aims to do this, providing users with a view of their entire data landscape so they can quickly understand what was in the dark and why.
Q2. What are the main drawbacks of traditional topdown methods of cataloging or “master data management”?
Gideon Goldin: The main drawbacks are scalability and simplicity. When Yahoo, for example, started to catalog the web they employed some topdown approaches, hiring specialists to curate structured directories of information. As the web grew, however, their solution became less relevant and significantly more costly. Google, on the other hand, mined the web to understand references that exist between pages, allowing the relevance of sites to emerge from the bottomup. As a result, Google’s search engine was more accurate, easier to scale, and simpler.
Topdown cataloging and masterdata management tools typically require expensive data curators, and are not simple to use. This poses a significant threat to cataloging efforts since so much knowledge about your organization’s data is inevitably clustered across the minds of the people who need to question it and the applications they use to answer those questions. Tamr Catalog aims to deliver an innovative and vastly simplified method for cataloging your organization’s data.
Q3. Tamr recently opened a public Beta program Tamr Catalog for an enterprise metadata catalog. What is it?
Gideon Goldin: The Tamr Catalog Beta Program is an open invitation to testdrive our free cataloging software. We have yet to find an organization that is content with their current cataloging approaches, and we found that the biggest barrier to reform is often knowing where to start. Catalog can help: the goal of the Catalog Beta Program is to better understand how people want and need to collaborate around their data sources. We believe that an early partnership with the community will ensure that we develop useful functionality and thoughtful design.
Q4 What are the core functionality of Tamr Catalog?
Gideon Goldin: Tamr Catalog enables users to easily register, discover and organize their data assets.
Q5. How does it help simplify access to highquality data sets for analytics?
Gideon Goldin: Not surprisingly, people are biased to use the data sets closest to them. With Catalog, scientists and analysts can easily discover unfamiliar data setsdata sets, for example, that may belong to other departments or analysts. Catalog profiles and collects pointers to your sources, providing multifaceted and visual browsing of all data trivializing the search for any given set of data.
Q6. How does Tamr Catalog relate to the Tamr Data Unification Platform?
Gideon Goldin: Before organizations can unify their data, preparing it for improved analysis or management, they need to know what they have. Organizations often lack a good approach for this first (and repeating) step in data unification. We realized this quickly when helping large organizations begin their unification projects, and we even realized we lacked a satisfactory tool to understand our own data. Thus, we built Catalog as a part of the Tamr Data Unification Platform to illuminate your data landscape, such that people can be confident that their unification efforts are as comprehensive as possible.
Q7. What are the main challenges (technical and non technical) in achieving a broad adoption of a vendor and platform neutral metadata cataloging?
Gideon Goldin: Often the challenge isn’t about volume, it’s about variety. While a vendor neutral Catalog intends to solve exactly this, there remains a technical challenge in providing a flexible and elegant interface for cataloging dozens or hundreds of different types of data sets and the structures they comprise.
However, we find that some of the biggest (and most interesting) challenges revolve around organizational processes and culture. Some organizations have developed sophisticated but unsustainable approaches to managing their data, while others have become paralyzed by the inherently disorganized nature of their data. It can be difficult to appreciate the value of investing in these problems. Figuring out where to start, however, shouldn’t be difficult. This is why we chose to release a lightweight application free of charge.
Q8. Chief Data Officers (CDOs), data architects and business analysts have different requirements and different modes of collaborating on (shared) data sets. How do you address this in your catalog?
Gideon Goldin: The goal of cataloging isn’t cataloging, it’s helping CDOs identify business opportunities, empowering architects to improve infrastructures, enabling analysts to enrich their studies, and more. Catalog allows anyone to register and organize sources, encouraging open communication along the way.
Q9. How do you handle issues such as data protection, ownership, provenance and licensing in the Tamr catalog?
Gideon Goldin: Catalog allows users to indicate who owns what. Over the course of our Beta program, we have been fortunate enough to have over 800 early users of Catalog and have collected feedback about how our users would like to see data protection and provenance implemented in their own environments. We are eager to release new functionality to address these needs in the near future.
Q10. Do you plan to use the Tamr Catalog also for collecting data sets that can be used for data projects for the Common Good?
Gideon Goldin: We do know of a few instances of Catalog being used for such purposes, including projects that will build on the documenting of city and health data. In addition to our Catalog Beta Program, we are introducing a Community Developer Program, where we are eager to see how the community links Tamr Catalogs to new sources (including those in other catalogs), new analytics and visualizations, and ultimately insights. We believe in the power of open data at Tamr, and we’re excited to learn how we can help the Common Good.
Gideon Goldin, UX Architect, Product Manager at Tamr.
Prior to Tamr, Gideon Goldin worked as a data visualization/UX consultant and university lecturer. He holds a Masters in HCI and a PhD in cognitive science from Brown University, and is interested in designing novel humanmachine experiences. You can reach Gideon on Twitter at @gideongoldin or email him at Gideon.Goldin at tamr.com.
-Tamr Catalog Developer Community
Online community where Tamr catalog users can comment, interact directly with the development team, and learn more about the software; and where developers can explore extending the tool by creating new data connectors.
Follow ODBMs.org on Twitter: @odbmsorg
“While it’s hard to pinpoint all of the key challenges for organizations hoping to effectively deploy their own predictive models, one significant challenge we’ve observed is the lack of C-level buy in.”–John K. Thompson
Q1. What are the key challenges for organizations to effectively deploy predictive models?
John: While it’s hard to pinpoint all of the key challenges for organizations hoping to effectively deploy their own predictive models, one significant challenge we’ve observed is the lack of C-level buy in. One direct example of this was Dell’s recent internal data migration from a legacy platform to its own platform, Statistica. It required major cultural change, involving identifying key change agents among Dell’s executive and senior management teams, who were responsible for enforcing governance as needed. On a technical level, Dell Statistica contains the most sophisticated algorithms for predictive analytics, machine learning and statistical analysis, enabling companies to find meaningful patterns in data. As 44 percent of organizations still don’t understand how to extract value from their data, revealed in Dell’s Global Technology Adoption Index 2015, Dell helps businesses invest wisely in data technologies, such as Statistica, to leverage the power of predictive analytics.
Q2. What is the role of users in running data analytics?
John: End-users turn to data analytics to better understand their businesses, predict change, increase agility and control critical systems through data. Customers use Statistica for predictive modeling, visualizations, text mining and data mining. With Statistica 13’s NDA capabilities, organizations can save time and resources by allowing the analytic processing to take place in the database or Hadoop cluster, rather than pulling data to a server or desktop. With features such as these, businesses can spend more time analyzing and making decisions from their data vs. processing the information.
Q3. What are the key challenges for organizations to embed analytics across core processes?
John: Embedding analytics across an organization’s core processes helps offer analytics to more users and allows it to become more universally accepted throughout the business. One of the largest challenges of embedding analytics is the attempt to analyze unorganized datasets. This can lead to miscategorization of the data, which can eventually result in making inaccurate business decisions. At Dell’s annual conference, Dell World, on October 20, we announced new offerings and capabilities that enable companies to embed analytics across their core processes and disseminate analytics expertise to give scalability to data-based decision making.
Q4. How is analytics related to the Internet of Things?
John: Data analytics and the Internet of Things go hand in hand. In the modern data economy, the ability to gain predictive insight from all data is critical to building an agile, connected and thriving data-driven enterprise. Whether the data comes from real-time sensors from an IoT environment, or a big data platform designed for analytics on massive amounts of disparate data, our new offerings enable detailed levels of insight and action. With the new capabilities and enhancements delivered in Statistica 13, Dell is making it possible for organizations of all sizes to deploy predictive analytics across the enterprise and beyond in a smart, simple and cost-effective manner. We believe this ultimately empowers them to better understand customers, optimize business processes, and create new products and services.
Q5. On big data and analytics Dell has announced new offerings to its end-to-end big data and analytics portfolio. What are these new offerings?
John: Dell is announcing a series of new big data and analytics solutions and services designed to help companies quickly and securely turn data into insights for better, faster decision-making. Statistica 13, the newest version of our advanced analytics software, makes it easier for organizations to deploy predictive models across the enterprise to reveal business and customer insights. Dell Services’ Analytics-as-a-Service offerings target specific industries, including banking and insurance, to provide actionable information, and better understand customers and business processes. Overall, with these enhancements, Dell is making it easier for organizations to understand how to invest in big data technologies and leverage the power of predictive analytics.
Q6. Dell is not a software company. How do you help customers turn data into insights for better decision making?
John: Dell has made great strides in the software industry, and specifically, the big data and analytics space, since our 2014 acquisition of StatSoft. Both Statistica 13 and Dell’s expanded Analytics-as-a-Service offerings help customers better unearth insights, predict business outcomes, and improve accuracy and efficiency of critical business processes. For example, the new analytics-enabled Business Process Outsourcing (BPO) services help organizations deal with fraud, denial likelihood scoring and customer retention. Additionally, the Dell ModelHealth Tracker to helps customers track and monitor the effectiveness of their various predictive analytics models, leading to better business-decision making at every level.
Q7. What are the main advancements to Dell`s analytics platform that you have introduced? And why?
John: The launch of Statistica 13 helps simplify the way organizations of all sizes deploy predictive models directly to data sources inside the firewall, in the cloud and in partner ecosystems. Additionally, Statistica 13 requires no coding and integrates seamlessly with open source R, which helps organizations leverage all data to predict future trends, identify new customers and sales opportunities, explore “what-if” scenarios, and reduce the occurrence of fraud and other business risks. The full list of enhancements include:
- A modernized GUI for greater ease-of-use and visual appeal
- More integration with the recently added Statistica Interactive Visualization and Dashboard engine
- More integration with open source R allowing for more control of R scripts
- A new stepwise model tool that gradually recommends optimum models for users
- New Native Distributed Analytics (NDA) capabilities that allow users to run analytics directly in the database where data lives and work more efficiently with large and growing data sets
Q8. Why did you introduce a new package of analytics-as-a-service offerings for industry verticals?
John: We’re announcing new analytics-as-a-service offerings in the healthcare and financial industries as those are two areas in which we’re seeing not only extreme growth, but an increased willingness and appetite for leveraging predictive analytics. These new services include:
- Fraud, Waste and Abuse Management:Allows businesses to better identify medical identity theft, unnecessary diagnostic services or medically unnecessary services and incorrect billing.
- Denial Likelihood Scoring and Predictive Analytics:Allows business to proactively identify which claims are most likely to be denied while providing at-a-glance activity data on each account. This can help eliminate up to 40 percent of low- or no-value follow-up work.
- Churn Management/Customer Retention Services:Allows businesses to leverage predictive churn modelling. This helps users identify customers they are at risk of losing and proactively take preventative measures.
Q9. Dell has launched a new purpose-built IoT gateway series with analytics capabilities. What is it and what is it useful for?
John: The new Dell Edge Gateway 5000 Series is a solution designed purpose-built for Industrial IoT. Combined with Statistica, the solution promises to give companies an edge computing solution alternative to today’s costly and proprietary IoT offerings. Thanks to new capabilities in Statistica 13, Dell is now expanding analytics to the gateway, allowing companies to extend the benefits of cloud computing to their network edge. In turn, this allows for more secure business insights, and saves companies the costly transfer of data to and to and from the cloud.
Q10. Anything else you wish to add?
John: If you’d like to hear more about what’s coming from Dell Software at Dell World 2015, check our Twitter feed at @DellSoftware for real-time updates.
John K. Thompson
John K. Thompson is the general manager of global advanced analytics at Dell Software. John has 25 years of experience in building and growing technology companies in the information management segment. He has developed and executed plans for overall sales and marketing, product development and market entry. His focus areas are big data, descriptive & predictive analytics, cognitive computing, and data mining. John holds a BS in Computer Science from Ferris State University and a MBA in Marketing from DePaul University.
– Dell Study Reveals Companies Investing in Cloud, Mobility, Security and Big Data Are Growing More Than 50 Percent Faster Than Laggards, Dell Press release, 13 Oct 2015.
Follow ODBMS.org on Twitter: @odbmsorg
“The question of ‘who owns the data’ will undoubtedly add requirements on the underlying service architecture and database, such as the ability to add meta-data relationships representing the provenance or ownership of specific device data.”–Steve Cellini
I have interviewed Steve Cellini, Vice President of Product Management at NuoDB. We covered the challenges and opportunities of The Internet of Things, seen from the perspective of a database vendor.
Q1. What are in your opinion the main Challenges and Opportunities of The Internet of Things (IoT) seen from the perspective of a database vendor?
Steve Cellini: Great question. With the popularity of Internet of Things, companies have to deal with various requirements, including data confidentiality and authentication, access control within the IoT network, privacy and trust among users and devices, and the enforcement of security and privacy policies. Traditional security counter-measures cannot be directly applied to IoT technologies due to the different standards and communication stacks involved. Moreover, the high number of interconnected devices leads to scalability issues; therefore a flexible infrastructure is needed to be able to deal with security threats in such a dynamic environment.
If you think about IoT from a data perspective, you’d see these characteristics:
• Distributed: lots of data sources, and consumers of workloads over that data are cross-country, cross-region and worldwide.
• Dynamic: data sources come and go, data rates may fluctuate as sets of data are added, dropped or moved into a locality. Workloads may also fluctuate.
• Diverse: data arrives from different kinds of sources
• Immediate: some workloads, such as monitoring, alerting, exception handling require near-real-time access to data for analytics. Especially if you want to spot trends before they become problems, or identify outliers by comparison to current norms or for a real-time dashboard.
These issues represent opportunities for the next generation of databases. For instance, the need for immediacy turns into a strong HTAP (Hybrid Transactional and Analytic Processing) requirement to support that as well as the real-time consumption of the raw data from all the devices.
Q2. Among the key challenge areas for IoT are Security, Trust and Privacy. What is your take on this?
Steve Cellini: IoT scenarios often involve human activities, such as tracking utility usage in a home or recording motion received from security cameras. The data from a single device may be by itself innocuous, but when the data from a variety of devices is combined and integrated, the result may be a fairly complete and revealing view of one’s activities, and may not be anonymous.
With this in mind, the associated data can be thought of as “valuable” or “sensitive” data, with attendant requirements on the underlying database, not dissimilar from, say, the kinds of protections you’d apply to financial data — such as authentication, authorization, logging or encryption.
Additionally, data sovereignty or residency regulations may also require that IoT data for a given class of users be stored in a specific region only, even as workloads that consume that data might be located elsewhere, or may in fact roam in other regions.
There may be other requirements, such as the need to be able to track and audit intermediate handlers of the data, including IoT hubs or gateways, given the increasing trend to closely integrate a device with a specific cloud service provider, which intermediates general access to the device. Also, the question of ‘who owns the data’ will undoubtedly add requirements on the underlying service architecture and database, such as the ability to add meta-data relationships representing the provenance or ownership of specific device data.
Q3. What are the main technical challenges to keep in mind while selecting a database for today’s mobile environment?
Steve Cellini: Mobile users represent sources of data and transactions that move around, imposing additional requirements on the underlying service architecture. One obvious requirement is to enable low-latency access to a fully active, consistent, and up-to-date view of the database, for both mobile apps and their users, and for backend workloads, regardless of where users happen to be located. These two goals may conflict if the underlying database system is locked to a single region, or if it’s replicated and does not support write access in all regions.
It can also get interesting when you take into account the growing body of data sovereignty or residency regulations. Even as your users are traveling globally, how do you ensure that their data-at-rest is being stored in only their home region?
If you can’t achieve these goals without a lot of special-case coding in the application, you are going to have a very complex, error-prone application and service architecture.
Q4. You define NuoDB as a scale-out SQL database for global operations. Could you elaborate on the key features of NuoDB?
Steve Cellini: NuoDB offers several key value propositions to customers: the ability to geo-distribute a single logical database across multiple data centers or regions, arbitrary levels of continuous availability and storage redundancy, elastic horizontal scale out/in on commodity hardware, automation, ease and efficiency of multi-tenancy.
All of these capabilities enable operations to cope flexibly, efficiently and economically as the workload rises and dips around the business lifecycle, or expands with new business requirements.
Q5. What are the typical customer demands that you are responding to?
Steve Cellini: NuoDB is the database for today’s on-demand economy. Businesses have to respond to their customers who demand immediate response and expect a consistent view of their data, whether it be their bank account or e-commerce apps — no matter where they are located. Therefore, businesses are looking to move their key applications to the cloud and ensure data consistency – and that’s what is driving the demand for our geo-distributed SQL database.
Q6. Who needs a geo-distributed database? Could you give some example of relevant use cases?
Steve Cellini: A lot of our customers come to us precisely for our geo distributed capability – by which I mean our ability to run a single unified database spread across multiple locations, accessible for querying and updating equally in all those locations. This is important where applications have mobile users, switching the location they connect to. That happens a lot in the telecommuting industry. Or they’re operating ‘follow the sun’ services where a user might need to access any data from anywhere that’s a pattern with global financial services customers. Or just so they can offer the same low-latency service everywhere. That’s what we call “local everywhere”, which means you don’t see increasing delays, if you are traveling further from the central database.
Q7. You performed recently some tests using the DBT2 Benchmark. Why are you using the DBT2 Benchmark and what are the results you obtained so far?
Steve Cellini: The DBT2 (TPC/C) benchmark is a good test for an operational database, because it simulates a real-world transactional workload.
Our focus on DBT2 hasn’t been on achieving a new record for absolute NOTPM rates, but rather to explore one of our core value propositions — horizontal scale out on commodity hardware. We recently passed the 1 million NOTPM mark on a cluster of 50 low-cost machines and we are very excited about it.
Q8. How is your offering in the area of automation, resiliency, and disaster recovery different (or comparable) with some of the other database competitors?
Steve Cellini: We’ve heard from customers who need to move beyond the complexity, pain and cost of their disaster recovery operations, such as expanding from a typical two data center replication operation to three or more data centers, or addressing lags in updates to the replica, or moving to active/active.
With NuoDB, you use our automation capability to dynamically expand the number of hosts and regions a database operates in, without any interruption of service. You can dial in the level of compute and storage redundancy required and there is no single point of failure in a production NuoDB configuration. And you can update in every location – which may be more than two, if that’s what you need.
Steve Cellini VP, Product Management, NuoDB
Steve joined NuoDB in 2014 and is responsible for Product Management and Product Support, as well as helping with strategic partnerships.
In his 30-year career, he has led software and services programs at various companies – from startups to Fortune 500 – focusing on bringing transformational technology to market. Steve started his career building simulator and user interface systems for electrical and mechanical CAD products and currently holds six patents.
Prior to NuoDB, Steve held senior technical and management positions on cloud, database, and storage projects at EMC, Mozy, and Microsoft. At Microsoft, Steve helped launch one of the first cloud platform services and led a company-wide technical evangelism team. Steve has also built and launched several connected mobile apps. He also managed Services and Engineering groups at two of the first object database companies – Ontos (Ontologic) and Object Design.
Steve holds a Sc.B in Engineering Physics from Cornell University.
Follow ODBMS.org on Twitter: @odbmsorg
“Traditional OLAP tools run into problems when trying to deal with massive data sets and high cardinality.”–Ajay Anand
I have interviewed Ajay Anand, VP Product Management and Marketing, Kyvos Insights. Main topic of the interview is big data analytics.
Q1. In your opinion, what are the current main challenges in obtaining relevant insights from corporate data, both structured and unstructured, regardless of size and granularity?
Ajay Anand: We focus on making big data accessible to the business user, so he/she can explore it and decide what’s relevant. One of the big inhibitors to the adoption of Hadoop is that it is a complex environment and daunting for a business user to work with. Our customers are looking for self-service analytics on data, regardless of the size or granularity. A business user should be able to explore the data without having to write code, look at different aspects of the data, and follow a train of thought to answer a business question, with instant, interactive response times.
Q2. What is your opinion about using SQL on Hadoop?
Ajay Anand: SQL is not the most efficient or intuitive way to explore your data on Hadoop. While Hive, Impala and others have made SQL queries more efficient, it can still take tens of minutes to get a response when you are combining multiple data sets and dealing with billions of rows.
Q3. Kyvos Insights emerged a couple of months ago from Stealth mode. What is your mission?
Ajay Anand: Our mission is to make big data analytics simple, interactive, enjoyable, massively scalable and affordable. It should not be just the domain of the data scientist. A business user should be able to tap into the wealth of information and use it to make better business decisions or wait for reports to be generated.
Q4. There are many diverse tools for big data analytics available today. How do you position your new company in the already quite full market for big data analytics?
Ajay Anand: While there are a number of big data analytics solutions available in the market, most customers we have talked to still had significant pain points. For example, a number of them are Tableau and Excel users. But when they try to connect these tools to large data sets on Hadoop, there is a significant performance impact. We eliminate that performance bottleneck, so that users can continue to use their visualization tool of choice, but now with response time in seconds.
Q5. You offer “cubes on Hadoop.” Could you please explain what are such cubes and what are the useful for?
Ajay Anand: OLAP cubes are not a new concept. In most enterprises, OLAP tools are the preferred way to do fast, interactive analytics.
However, traditional OLAP tools run into problems when trying to deal with massive data sets and high cardinality.
That is where Kyvos comes in. With our “cubes on Hadoop” technology, we can build linearly scalable, multi-dimensional OLAP cubes and store them in a distributed manner on multiple servers in the Hadoop cluster. We have built cubes with hundreds of billions of rows, including dimensions with over 300 million cardinality. Think of a cube where you can include every person in the U.S., and drill down to the granularity of an individual. Once the cube is built, now you can query it with instant response time, either from our front end or from traditional tools such as Excel, Tableau and others.
Q6. How do you convert raw data into insights?
Ajay Anand: We can deal with all kinds of data that has been loaded on Hadoop. Users can browse this data, look at different data sets, combine them and process them with a simple drag and drop interface, with no coding required. They can specify the dimensions and measures they are interested in exploring, and we create Hadoop jobs to process the data and build cubes. Now they can interactively explore the data and get the business insights they are looking for.
Q7. A good analytical process can result in poor results if the data is bad. How do you ensure the quality of data?
Ajay Anand: We provide a simple interface to view your data on Hadoop, decide the rules for dropping bad data, set filters to process the data, combine it with lookup tables and do ETL processing to ensure that the data fits within your parameters of quality. All of this is done without having to write code or SQL queries on Hadoop.
Q8. How do you ensure that the insights you obtained with your tool are relevant?
Ajay Anand: The relevance of the insights really depends on your use case. Hadoop is a flexible and cost-effective environment, so you are not bound by the constraints of an expensive data warehouse where any change is strictly controlled. Here you have the flexibility to change your view, bring in different dimensions and measures and build cubes as you see fit to get the insights you need.
Q9. Why do technical and/or business users want to develop multi-dimensional data models from big data, work with those models interactively in Hadoop, and use slice-and-dice methods? Could you give us some concrete examples?
Ajay Anand: An example of a customer that is using us in production to get insights on customer behavior for marketing campaigns is a media and entertainment company addressing the Latino market. Before using big data, they used to rely on surveys and customer diaries to track viewing behavior. Now they can analyze empirical viewing data from more than 20 million customers, combine it with demographic information, transactional information, geographic information and many other dimensions. Once all of this data has been built into the cube, they can look at different aspects of their customer base with instant response times, and their advertisers can use this to focus marketing campaigns in a much more efficient and targeted manner, and measure the ROI.
Q10. Could you share with us some performance numbers for Kyvos Insights?
Ajay Anand: We are constantly testing our product with increasing data volumes (over 50 TB in one use case) and high cardinality. One telecommunications customer is testing with subscriber information that is expected to grow to several trillion rows of data. We are also testing with industry standard benchmarks such as TPC-DS and the Star Schema Benchmark. We find that we are getting response times of under two seconds for queries where Impala and Hive take multiple minutes.
Q11. Anything else you wish to add?
Ajay Anand: As big data adoption enters the mainstream, we are finding that customers are demanding that analytics in this environment be simple, responsive and interactive. It must be usable by a business person who is looking for insights to aid his/her decisions without having to wait for hours for a report to run, or be dependent on an expert who can write map-reduce jobs or Hive queries. We are moving to a truly democratized environment for big data analytics, and that’s where we have focused our efforts with Kyvos.
Ajay Anand is vice president of products and marketing at Kyvos Insights, delivering multi-dimensional OLAP solutions that run natively on Hadoop. Ajay has more than 20 years of experience in marketing, product management and development in the areas of big data analytics, storage and high availability clustered systems.
Prior to Kyvos Insights, he was founder and vice president of products at Datameer, delivering the first commercial analytics product on Hadoop. Before that he was director of product management at Yahoo, driving adoption of the Hadoop based data analytics infrastructure across all Yahoo properties. Previously, Ajay was director of product management and marketing for SGI’s Storage Division. Ajay has also held a number of marketing and product management roles at Sun, managing teams and products in the areas of high availability clustered systems, systems management and middleware.
Ajay earned an M.B.A. and an M.S. in computer engineering from the University of Texas at Austin, and a BSEE from the Indian Institute of Technology.
Follow ODBMS.org on Twitter: @odbmsorg
“Many companies don’t use analytical fraud detection techniques yet. In fact, most still rely on an expert based approach, meaning that they build upon the experience, intuition and business knowledge of the fraud analyst.” –Bart Baesens
Q1. What is exactly Fraud Analytics?
Good question! First of all, in our book we define fraud as an uncommon, well-considered, imperceptibly concealed, time-evolving and often carefully organized crime which appears in many types of forms. The idea of using analytics for fraud detection is catalyzed by the enormous amount of data which is currently being generated in any business process. Think about insurance claim handling, credit card transactions, cash transfers, tax payments, etc. to name a few. In our book, we discuss various ways of analyzing these massive data sets in a descriptive, predictive or social network way to come up with new analytical fraud detection models.
Q2. What are the main challenges in Fraud Analytics?
The definition we gave above highlights the 5 key challenges in fraud analytics. The first one concerns the fact that fraud is uncommon. Independent of the exact setting or application, only a minority of the involved population of cases typically concerns fraud, of which furthermore only a limited number will be known to concern fraud. This seriously complicates the estimation of analytical models.
Fraudsters try to blend into the environment and not behave different from others in order not to get noticed and to remain covered by non-fraudsters. This effectively makes fraud imperceptibly concealed, since fraudsters do succeed in hiding by well considering and planning how to precisely commit fraud.
Fraud detection systems improve and learn by example. Therefore the techniques and tricks fraudsters adopt evolve in time along with, or better ahead of fraud detection mechanisms. This cat and mouse play between fraudsters and fraud fighters may seem to be an endless game, yet there is no alternative solution so far. By adopting and developing advanced analytical fraud detection and prevention mechanisms, organizations do manage to reduce losses due to fraud since fraudsters, like other criminals, tend to look for the easy way and will look for other, easier opportunities.
Fraud is typically a carefully organized crime, meaning that fraudsters often do not operate independently, have allies, and may induce copycats. Moreover, several fraud types such as money laundering and carousel fraud involve complex structures that are set up in order to commit fraud in an organized manner. This makes fraud not to be an isolated event, and as such in order to detect fraud the context (e.g., the social network of fraudsters) should be taken into account. This is also extensively discussed in our book.
A final element in the description of fraud provided in our book indicates the many different types of forms in which fraud occurs. This both refers to the wide set of techniques and approaches used by fraudsters as well as to the many different settings in which fraud occurs or economic activities that are susceptible to fraud.
Q3. What is the current state of the art in ensuring early detection in order to mitigate fraud damage?
Many companies don’t use analytical fraud detection techniques yet. In fact, most still rely on an expert based approach, meaning that they build upon the experience, intuition and business knowledge of the fraud analyst. Such an expert-based approach typically involves a manual investigation of a suspicious case, which may have been signaled for instance by a customer complaining of being charged for transactions he did not do. Such a disputed transaction may indicate a new fraud mechanism to have been discovered or developed by fraudsters, and therefore requires a detailed investigation for the organization to understand and subsequently address the new mechanism.
Comprehension of the fraud mechanism or pattern allows extending the fraud detection and prevention mechanism which is often implemented as a rule base or engine, meaning in the form of a set of IF-THEN rules, by adding rules that describe the newly detected fraud mechanism. These rules, together with rules describing previously detected fraud patterns, are applied to future cases or transactions and trigger an alert or signal when fraud is or may be committed by use of this mechanism. A simple, yet possibly very effective example of a fraud detection rule in an insurance claim fraud setting goes as follows:
- Amount of claim is above threshold OR
- Severe accident, but no police report OR
- Severe injury, but no doctor report OR
- Claimant has multiple versions of the accident OR
- Multiple receipts submitted
- Flag claim as suspicious AND
- Alert fraud investigation officer
Such an expert approach suffers from a number of disadvantages. Rule bases or engines are typically expensive to build, since requiring advanced manual input by the fraud experts, and often turn out to be difficult to maintain and manage. Rules have to be kept up to date and only or mostly trigger real fraudulent cases, since every signaled case requires human follow-up and investigation. Therefore the main challenge concerns keeping the rule base lean and effective, in other words deciding upon when and which rules to add, remove, update, or merge.
By using data-driven analytical models such as descriptive, predictive or social network analytics in a complimentary way, we can improve the performance of our fraud detection approaches in terms of precision, cost efficiency and operational effectiveness.
Q4. Is early detection all that can be done? Are there any other advanced techniques that can be used?
You can do more than just detection. More specifically, two components that are essential parts of almost any effective strategy to fight fraud concern fraud detection and fraud prevention. Fraud detection refers to the ability to recognize or discover fraudulent activities, whereas fraud prevention refers to measures that can be taken aiming to avoid or reduce fraud. The difference between both is clear-cut, the former is an ex post approach whereas the latter an ex ante approach. Both tools may and likely should be used in a complementary manner to pursue the shared objective, being fraud reduction. However, as also discussed in our book, preventive actions will change fraud strategies and consequently impact detection power. Installing a detection system will cause fraudsters to adapt and change their behavior, and so the detection system itself will impair eventually its own detection power. So although complementary, fraud detection and prevention are not independent and therefore should be aligned and considered a whole.
Q5. How do you examine fraud patterns in historical data?
You can examine it in two possible ways: descriptive or predictive. Descriptive analytics or unsupervised learning aims at finding unusual anomalous behavior deviating from the average behavior or norm. This norm can be defined in various ways. It can be defined as the behavior of the average customer at a snapshot in time, or as the average behavior of a given customer across a particular time period, or as a combination of both. Predictive analytics or supervised learning assumes the availability of a historical data set with known fraudulent transactions. The analytical models built can thus only detect fraud patterns as they occurred in the past. Consequently, it will be impossible to detect previously unknown fraud. Predictive analytics can however also be useful to help explain the anomalies found by descriptive analytics.
Q6. How do you typically utilize labeled, unlabeled, and networked data for fraud detection?
Labeled observations or transactions can be analyzed using predictive analytics. Popular techniques here are linear/logistic regression, neural networks and ensemble methods such as random forests. These techniques can be used to predict both fraud incidence, which is a classification problem, as well as fraud intensity, which is a classical regression problem. Unlabeled data can be investigated using descriptive analytics. As said, the aim here is to detect anomalies deviating from the norm. Popular techniques here are: break point analysis, peer group analysis, association rules and clustering. Networked data can be analyzed using social network techniques. We found those to be very useful in our research. Popular techniques here are community detection and featurization. In our research, we developed GOTCHA!, a supervised social network learner for fraud detection. This is also extensively discussed in our book.
Q6. Fraud techniques change over time. How do you handle this?
Good point! A key challenge concerns the dynamic nature of fraud. Fraudsters try to constantly out beat detection and prevention systems by developing new strategies and methods. Therefore adaptive analytical models and detection and prevention systems are required, in order to detect and resolve fraud as soon as possible. Detecting fraud as early as possible is crucial. Hence, we also discuss how to continuously backtest analytical fraud detection models. The key idea here is to verify whether the fraud model still performs satisfactory. Changing fraud tactics creates concept drift implying that the relationship between the target fraud indicator and the data available changes on an on-going basis. Hence, it is important to closely follow-up the performance of the analytical model such that concept drift and any related performance deviation can be detected in a timely way. Depending upon the type of model and its purpose (e.g. descriptive or predictive), various backtesting activities can be undertaken. Examples are backtesting data stability, model stability and model calibration.
Q7. What are the synergies between Fraud Analytics and CyberSecurity?
Fraud analytics creates both opportunities as well as threats for cybersecurity. Think about intrusion detection as an example Predictive methods can be adopted to study known intrusion patterns, whereas descriptive methods or anomaly detection can identify emerging cyber threats. The emergence of the Internet of Things (IoT) will certainly exacerbate the importance of fraud analytics for cybersecurity. Some examples of new fraud treats are:
- Fraudsters might force access to web configurable devices (e.g. Automated Teller Machines (ATMs)) and set up fraudulent transactions;
- Device hacking whereby fraudsters change operational parameters of connected devices (e.g. smart meters are manipulated to make them under register actual usage);
- Denial of Service (DoS) attacks whereby fraudsters massively attack a connected device to stop it from functioning;
- Data breach whereby a user’s log in information is obtained in a malicious way resulting into identity theft;
- Gadget fraud also referred to as gadget lust whereby fraudsters file fraudulent claims to either obtain a new gadget or free upgrade;
- Cyber espionage whereby exchanged data is eavesdropped by an intelligence agency or used by a company for commercial purposes.
More than ever before, fraud will be dynamic and continuously changing in an IoT context. From an analytical perspective, this implies that predictive techniques will continuously lag behind since they are based on a historical data set with known fraud patterns. Hence, as soon as the predictive model has been estimated, it will become outdated even before it has been put into production. Descriptive methods such as anomaly detection, peer group and break point analysis will gain in importance. These methods should be capable of analyzing evolving data streams and perform incremental learning to deal with concept drift. To facilitate (near) real-time fraud detection, the data and algorithms should be processed in-memory instead of relying on slow secondary storage. Furthermore, based upon the results of these analytical models, it should be possible to take fully automated actions such as the shutdown of a smart meter or ATM.
Qx Anything else you wish to add?
We are happy to refer to our book for more information. We also value your opinion and look forward to receiving any feedback (both positive and negative)!
Professor Bart Baesens is a professor at KU Leuven (Belgium), and a lecturer at the University of Southampton (United Kingdom). He has done extensive research on big data & analytics, customer relationship management, web analytics, fraud detection, and credit risk management. His findings have been published in well-known international journals and presented at international top conferences. He is also author of the books Analytics in a Big Data World (goo.gl/k3kBrB), and Fraud Analytics using Descriptive, Predictive and Social Network Techniques (http://goo.gl/nlCjUr). His research is summarised at www.dataminingapps.com. He is also teaching the E-learning course, Advanced Analytics in a Big Data World, see http://goo.gl/WibNPF. He also regularly tutors, advises and provides consulting support to international firms with respect to their analytics and credit risk management strategy.
–Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection (Wiley and SAS Business Series). Authors: Bart Baesens ,Veronique Van Vlasselaer,Wouter Verbeke.
Series: Wiley and SAS Business Series, Hardcover: 400 pages. Publisher: Wiley; 1 edition, September 2015. ISBN-10: 1119133122
–Fraud Analytics:Using Supervised, Unsupervised and Social Network Learning Techniques. Authors: Bart Baesens, Véronique Van Vlasselaer, Wouter Verbeke
Publisher: Wiley 256 pages
ISBN-13: 978-1119133124 | ISBN-10: 1119133122
– Critical Success Factors for Analytical Models: Some Recent Research Insights. Bart Baesens, ODBMS.org,
27 APR, 2015
– Analytics in a Big Data World: The Essential Guide to Data Science and its Applications. Bart Baesens, ODBMS.org, 30 APR, 2014
–The threat from AI is real, but everyone has it wrong, Robert Munro, CEO Idibon. ODBMS.org
Follow ODBMS.org on Twitter: @odbmsorg