“I’ve managed several employees who have successfully transitioned from an operations role to an analytics role. In fact, some of them have become my best analysts because they have brought a deeper domain knowledge to their analyses than someone approaching from the outside may have done. “–Rob Winters
I have interviewed Rob Winters,Head of Business Intelligence at TravelBird. The interview covers Rob`s projects experience with data analytics and HPE Vertica.
Q1. What is the business of TravelBird?
Rob Winters: TravelBird builds and provides a daily selection of inspirational holiday offerings in twelve markets across Europe. Our goal is to create packages which excite the imagination and bring simplicity and joy to the act of travelling. These packages are then shared with our travellers via email, our website, and our iOS and Android applications.
Q2. What are the current data projects at TravelBird?
Rob Winters: TravelBird’s journey with being data driven is relatively short, beginning our initial Business Intelligence buildout in mid-2015. Currently our BI team is engaged in a number of projects, both more traditional BI and advanced analytics, including:
– Building data sources and training an organization in self-service BI
– Replacing our generic daily selections with personalized content selection models
– Optimizing pricing of packages based on product price volatility and customer demand
– Adjusting email frequency and timing to improve customer engagement and lifetime value
Q3. What is your experience in using predictive analytics?
Rob Winters: I have been working in the predictive analytics field for six years now across a variety of problem areas – customer service, retail, gaming, and now travel. From a technology standpoint I originally worked heavily with commercial solutions (Teradata, SAS) but for the last four years have used almost exclusively open source software including Hadoop, Spark, R, and Python.
Q4. How do you evaluate if your discovering insights are “good”?
Rob Winters: During the initial development of our algorithms we will typically follow a basic version of CRISP-DM to build an initial working model for our problem. To test models, we always use an A/B test and typically follow a two phase process: first the model is split-test against the current operational process/human selection, then when the model consistently outperforms the status quo, we will test future model iterations against the control.
Q5. Can you tell us a bit about the work you did in designing and implementing a fully automated, machine learning based content selection platform?
Rob Winters: To provide context, every day our planning team creates six unique product offerings for their target market of 50-500k customers to be shared via web, iOS/Android app, and email. Our goal was to replace that model with one that selects six unique products for each recipient based on past browsing and travel behavior. To do so, we designed an ensemble model consisting of several components:
– A customer preference model (user-item recommendation model)
– A product similarity model (item-item similarity)
– A “hotness” model to promote destinations which are trending/outperforming/expected to do well
– A portfolio model to select the right diversity for each recipient based on recommendation confidence, lifecycle state, and yield optimization of cannibalization vs product fit for a recipient
The data to feed these models is based on observing dozens of events per recipient per day, positive and negative feedback events of the recipient, all observable product features, and human expert input. The models are also able to improve themselves by continuously tuning the input parameters of each model based on recommendation split testing.
Q6. What are the primary technologies you are using?
Rob Winters: Our technology stack consists of the following:
-Data warehousing: HPE Vertica
-Operations DBs: MySQL (web services) + Postgres (internal services)
-Recommendations serving: Redis
-Modeling/Analysis: Python, Spark via PySpark
Q7. What is your experience in using HPE Vertica?
Rob Winters: I have been using Vertica for five years in a number of organizations and facilitated the first rollout in the Netherlands. During that time I have been primarily an end user/data analyst but have also been the DBA for my deployments for the last two years.
Q8: Can you give us some more technical details of what was this first rollout in the Netherlands? What challenges did you solve in using HPE Vertica? What business benefits did you obtain?
Rob Winters: The objective of our rollout was to implement a centralized company datawarehouse to unify several production databases plus external API data.
The existing platform was Postgres (row-based solution) and relatively limited in performance. Primary gains were significantly faster analytics, the ability to add in several terabytes of event data (which was not possible on the prior platform), and new insights into the email database regarding churn, conversion, and customer value.
Q9: What were the main criteria for you to choose HPE Vertica? Did you do any performance test for HPE Vertica?
Rob Winters: We considered a number of alternatives including Microsoft PDW, Greenplum, and Infobright.
The primary considerations were price/performance, scalability, and analytical functionality. We found Vertica to be the best options across those aspects. Regarding performance testing, we did compare Infobright and Vertica and found the latter to be both more performant and easier to work with.
Q10. What specific functionalities of HPE Vertica do you find particularly useful in your job?
Rob Winters: There are a number of aspects which I find extremely beneficial, including:
-Ease of administration
-Performance tunability is very good, much higher than (for example) Redshift
-Analytical function extensions enable extremely powerful analyses directly via SQL
-The ability to load JSON data allows very rapid data integration from new sources
Q11. Do you think is it possible to turn an employee into a data analyst?
Rob Winters: Absolutely, I’ve managed several employees who have successfully transitioned from an operations role to an analytics role. In fact, some of them have become my best analysts because they have brought a deeper domain knowledge to their analyses than someone approaching from the outside may have done. The biggest drivers for success in the transitition have been:
– Attitude/eagerness to learn
– Close collaboration with a more experienced analyst, either their supervisor or a more senior peer
– Making their initial projects in areas where they are unable to fall back on domain knowledge
Rob Winters, Head of Business Intelligence at TravelBird.
Rob has been working with and leading analytics teams since 2006 across a number of industries including telco, gaming, retail, and travel. His primary focus since 2011 has been green-field implementations of technology and team creation for both traditional business intelligence and predictive analytics; full details are listed on my linkedin profile. He holds a bachelor’s in economics and an MBA with a IT concentration.
– Data-X: Video lectures on very practical and applied Data Analytics. Data-X is a project to produce a collection of video lectures on very practical and applied data analytics.
Follow us on Twitter: @odbmsorg
“I think we’re just beginning to grapple with implications of data as an economic asset” –Steve Lohr.
My last interview for this year is with Steve Lohr. Steve Lohr has covered technology, business, and economics for the New York Times for more than twenty years. In 2013 he was part of the team awarded the Pulitzer Prize for Explanatory Reporting. We discussed Big Data and how it influences the new Artificial Intelligence awakening.
Wishing you all the best for the Holiday Season and a healthy and prosperous New Year!
Steve Lohr: Both Google and Microsoft are contributing their tools to expand and enlarge the AI community, which is good for the world and good for their businesses. But I also think the move is a recognition that algorithms are not where their long-term advantage lies. Data is.
Q2. What are the implications of that for both business and policy?
Steve Lohr: The companies with big data pools can have great economic power. Today, that shortlist would include Google, Microsoft, Facebook, Amazon, Apple and Baidu.
I think we’re just beginning to grapple with implications of data as an economic asset. For example, you’re seeing that now with Microsoft’s plan to buy LinkedIn, with its personal profiles and professional connections for more than 400 million people. In the evolving data economy, is that an antitrust issue of concern?
Q3. In this competing world of AI, what is more important, vast data pools, sophisticated algorithms or deep pockets?
Steve Lohr: The best answer to that question, I think, came from a recent conversation with Andrew Ng, a Stanford professor who worked at GoogleX, is co-founder of Coursera and is now chief scientist at Baidu. I asked him why Baidu, and he replied there were only a few places to go to be a leader in A.I. Superior software algorithms, he explained, may give you an advantage for months, but probably no more. Instead, Ng said, you look for companies with two things — lots of capital and lots of data. “No one can replicate your data,” he said. “It’s the defensible barrier, not algorithms.”
Q4. What is the interplay and implications of big data and artificial intelligence?
Steve Lohr: The data revolution has made the recent AI advances possible. We’ve seen big improvements in the last few years, for example, in AI tasks like speech recognition and image recognition, using neural network and deep learning techniques. Those technologies have been around for decades, but they are getting a huge boost from the abundance of training data because of all the web image and voice data that can be tapped now.
Q5. Is data science really only a here-and-now version of AI?
Steve Lohr: No, certainly not only. But I do find that phrase a useful way to explain to most of my readers — intelligent people, but not computer scientists — the interplay between data science and AI. To convey that rudiments of data-driven AI are already all around us. It’s not — surely not yet — robot armies and self-driving cars as fixtures of everyday life. But it is internet search, product recommendations, targeted advertising and elements of personalized medicine, to cite a few examples.
Q6. Technology is moving beyond increasing the odds of making a sale, to being used in higher-stakes decisions like medical diagnosis, loan approvals, hiring and crime prevention. What are the societal implications of this?
Steve Lohr: The new, higher-stakes decisions that data science and AI tools are increasingly being used to make — or assist in making — are fundamentally different than marketing and advertising. In marketing and advertising, a decision that is better on average is plenty good enough. You’ve increased sales and made more money. You don’t really have to know why.
But the other decisions you mentioned are practically and ethically very different. These are crucial decisions about individual people’s lives. Better on average isn’t good enough. For these kinds of decisions, issues of accuracy, fairness and discrimination come into play.
That, I think, argues for two things. First, some sort of auditing tool; the technology has to be able to explain itself, to explain how a data-driven algorithm came to the decision or recommendation that it did.
Second, I think it argues for having a “human in the loop” for most of these kinds of decisions for the foreseeable future.
Q7. Will data analytics move into the mainstream of the economy (far beyond the well known, born-on-the-internet success stories like Google, Facebook and Amazon)?
Steve Lohr: Yes, and I think we’re seeing that now in nearly every field — health care, agriculture, transportation, energy and others. That said, it is still very early. It is a phenomenon that will play out for years, and decades.
Recently, I talked to Jeffrey Immelt, the chief executive of General Electric, America’s largest industrial company. GE is investing heavily to put data-generating sensors on its jet engines, power turbines, medical equipment and other machines — and to hire software engineers and data scientists.
Immelt said if you go back more than a century to the origins of the company, dating back to Thomas Edison‘s days, GE’s technical foundation has been materials science and physics. Data analytics, he said, will be the third fundamental technology for GE in the future.
I think that’s a pretty telling sign of where things are headed.
Steve Lohr has covered technology, business, and economics for the New York Times for more than twenty years and writes for the Times’ Bits blog. In 2013 he was part of the team awarded the Pulitzer Prize for Explanatory Reporting.
He was a foreign correspondent for a decade and served as an editor, and has written for national publications such as the New York Times Magazine, the Atlantic, and the Washington Monthly. He is the author of Go To: The Story of the Math Majors, Bridge Players, Engineers, Chess Wizards, Maverick Scientists, Iconoclasts—the Programmers Who Created the Software Revolution and Data-ism The Revolution Transforming Decision Making, Consumer Behavior, and Almost Everything Else.
He lives in New York City.
Follow us on Twitter:@odbmsorg
“It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.”–Nick Heudecker.
I have interviewed Nick Heudecker, Research Director on Gartner’s Data & Analytics team.
The main topic of the interview is the new Magic Quadrant for Operational Database Management Systems.
Q1. You have published the new Magic Quadrant for Operational Database Management Systems (*). How do you define the operational database management system market?
Nick Heudecker: We define a DBMS as a complete software system used to define, create, manage, update and query a database. DBMSs provide interfaces to independent programs and tools that both support and govern the performance of a variety of concurrent workload types. There is no presupposition that DBMSs must support the relational model or that they must support the full set of possible data types in use today. OPDBMSs must include functionality to support backup and recovery, and have some form of transaction durability — although the atomicity, consistency, isolation and durability model is not a requirement. OPDBMSs may support multiple delivery models, such as stand-alone DBMS software, certified configurations, cloud (public and private) images or versions, and database appliances.
Q2. Can you explain the methodology you used for this new Magic Quadrant?
Nick Heudecker: The methodologies for several Gartner methodologies are public. The Magic Quadrant methodology can be found here.
We use a number of data sources when we’re creating the Magic Quadrant for Operational Database Management Systems.
We survey vendor reference customers and include data from our interactions with Gartner clients. We also consider earlier information and any news about vendors’ products, customers and finances that came to light during the time frame for our analysis.
Once we have the data, we score vendors across the various dimensions of Completeness of Vision and Ability to Execute.
One thing that’s important to note is Magic Quadrants are relative assessments of vendors in a market. We couldn’t have one vendor on an MQ because it would be right in the middle – there’s nothing to compare it to.
Q3. Why were there no Visionaries this year?
Nick Heudecker: We determined there was an overall lack of vision in the market. After a few years of rapid feature expansion, the focus has shifted to operational excellence and execution. Even Leaders shifted to the left on vision, but are still placed in the Leaders quadrant based on their vision for the development of hybrid database management, hardware optimization and integration, emerging deployment models such as containerization, as well as vertical features.
Q4. Were you surprised by the analysis and some of the results you obtained?
Nick Heudecker: The lack of overall vision in the market struck us the most. Other than in a few notable cases, we received largely the same story from most vendors. The explosion of features, and the vendors emerging to implement them, has slowed. The features that initiated the expansion, such as storing new data types, geographically distributed storage, cloud and flexible data consistency models, have become common. Today, nearly every established or emerging DBMS vendor supports these features to some degree. The OPDBMS market has shifted from a phase of rapid innovation to a phase of maturing products and capabilities.
Q5. Do you believe the “NoSQL” label will continue to distinguish DBMSs?
Nick Heudecker: If you look at the entire operational DBMS space, there’s already a great deal of convergence between NoSQL vendors, as well as between NoSQL and traditionally relational vendors. Nearly every vendor, nonrelational and relational, supports multiple data types, like JSON documents, graph or wide-column. NoSQL vendors are adding SQL: MongoDB’s BI Connector and Couchbase’s N1QL are good, if diverse, examples. They’re also adding things like schema management and data validation capabilities.
On the relational side, they’re adding horizontal scaling options and alternative consistency models, as well as modern APIs. And everyone either has or is adding in-memory and cloud capabilities.
It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.
Q6. What are the other “Vendors to Consider”?
Nick Heudecker: The other vendors to consider are vendors that did not meet the inclusion requirements for the Magic Quadrant. Usually this is because they missed our minimum revenue requirements, but that doesn’t mean they don’t have compelling products.
Nick Heudecker is a Research Director on Gartner’s Data & Analytics team. His coverage includes data management technologies and practices.
(*) Magic Quadrant for Operational Database Management Systems. Published: 05 October 2016 ID: G00293203Analyst(s): Nick Heudecker, Donald Feinberg, Merv Adrian, Terilyn Palanca, Rick Greenwald
– Complimentary Gartner Research: 100 Data and Analytics Predictions Through 2020. Get exclusive access to Gartner’s top 100 data and analytics predictions through 2020. Plus access other relevant Gartner research including Magic Quadrant reports for database and data warehouse solutions, and the market guide for in-memory computing (LINK to MemSQL web site – registration required).
– MarkLogic Recognized in New Gartner® Magic Quadrant. Gartner Magic Quadrant for Operational Database Management Systems positions MarkLogic® the highest for ability to execute in the Challengers Quadrant
Follow us on Twitter: @odbmsorg
“From my perspective, data quality is paramount to an evolving market. When the quality of data improves in a market, both insurance carriers and consumers can make better decisions. “–Charles Kaminski Jr.
I have interviewed Charles Kaminski Jr., Sr. Architect at LexisNexis Risk Solutions. Main topics of the interview are the technological challenges the insurance industry is currently facing, fraud detection, and how to effectively use predictive analytics.
Q1. What is your role at LexisNexis Risk Solutions?
Charles Kaminski Jr.: I am a Sr. Architect at LexisNexis Risk Solutions. I’ve worked for LexisNexis Risk Solutions for about 7 years. My primary responsibility is international expansion for the Insurance vertical. I also work on enterprise initiatives, new technologies, new product development, patents & intellectual property, and acquisitions. From time to time I work with RELX sister companies when they need help. The RELX Group is our parent company.
Q2. How is the life insurance industry evolving?
Charles Kaminski Jr.: My view is somewhat specific to the international markets I serve. From my perspective, data quality is paramount to an evolving market. When the quality of data improves in a market, both insurance carriers and consumers can make better decisions. As that happens, the vast majority of consumers and other players in that market benefit. This isn’t limited to the life insurance industry, but I see it happening there as well.
Q3. What are in your opinion, the main technological challenges the insurance industry is currently facing?
Charles Kaminski Jr.: Each market around the global tends to have its own nuances that don’t apply to any other market. An entity in one market (such as a bank, an aggregator, or a software house) may play a different role or no role at all in another market. Regulations, government involvement, and industry support also vary greatly. I see this in auto, life, and health verticals. These factors create different challenges from one market to the next. But, there are a few themes that seem to exist regardless of market.
Insurance carriers around the globe tend to utilize a healthy mixture of old and new technologies. The technology leaders in this industry are generally more risk adverse when compared to other, less regulated, industries. Also, workflows on the carrier side can be very complex. The primary technological challenge to new product development is understanding customer and vendor technology roadmaps and the implied assumptions in those roadmaps. Understanding the entities in a market as well as their roadmaps is key to being successful.
Q4. Cross-industry fraud is defined by a fraud case where the perpetrator’s activity touches multiple industries and organizations, habitually exploiting system gaps. Is using data and analytics the solution to fraud detection?
Charles Kaminski Jr.: A product person might better answer if using data and analytics is “the” solution to fraud detection. I can tell you it is a very effective solution. Big data can cross boundaries and tell unique stories like no other tool. Companies that reign supreme in crossing those boundaries are the ones that have the technical capabilities to analyze big data with ease and the creative people to ask questions no one else is thinking to ask. One interesting story I can relay here is from work others at LexisNexis have done. It comes from someone I’ve shared a stage with a number of times, so I’m very familiar with the story.
LexisNexis Risk Solutions was asked to help a US state agency identify potential Medicaid fraud. Medicaid fraud is big business with lots of money changing hands. For any state agency with limited resources, it’s never a question of finding enough fraud to prosecute. It’s always a question of finding the big fish to fry.
This US state agency in question could only share the addresses of people using Medicaid and nothing more.
Just a list of addresses is not much to go on. But with the right tools, it’s a good start: Why is someone at one address registering a number of really expensive cars? Why is someone at another address registering a rather expensive boat?
Why is someone at yet another address who owns a Medicaid processing business and buying multiple multi-million-dollar condos, why are they possibly on Medicaid?
Some of these will no doubt be coincidence and I’m oversimplifying this by not mentioning some additional and rather complex analysis. I’m sure you get the idea though. Ultimately you have an interesting list of addresses scored and ordered in terms of where you might want to take a closer look. But that’s not where this story ends. That scored and ordered list is just where this story starts to get interesting.
With a big-data system geared towards analytics, we can take that list and overlay relationship data on top of it.
You can build relationship data from all kinds of sources — who’s married or ever been married to whom, previous neighbors who lived near each other, jointly-registered assets, various public records from business dealings, etc.
When we overlay who knows who, multiple circles start to form. People who don’t know each other are in these circles and at the center of many of these circles (connecting them together) are people who weren’t in the original address list.
Those folks in the center of those circles are the big fish to take a closer look at. Many of these people in the center are the generals recruiting lieutenants to commit the fraud for them. These generals do this so they can stay below the radar.
That’s the interesting part of this story. It’s a story of how big data and analytics can take you from just a list of addresses to some big fish in the center of a fraud ring.
Q5. Drew Whitmore, Senior Director, Insurance Global Alliances, LexisNexis® Risk Solutions, said: “Insurance carriers need innovative core policy and claims management solutions integrated with industry-leading data and analytics to meet their business objectives and deliver on promises of exceptional customer experience,” Why do you believe that a single point of entry to these data and analytic solutions is the best option for insurers’ technology resources and workflow processes?
Charles Kaminski Jr.: Insurance workflows can be very complex. Products that support these workflows can have complex interface. To a technologist, success with a single-point-of-entry strategy is very clear.
Success is when we release a new product but 90% of the single-point-of-entry-interface doesn’t change.
Further still, success is when the technologist on the other side, the employee of the customer, knows exactly what is going on with the new product. Success is when a technologist on the other end of the interface says, “I get what LexisNexis is doing with this.” That technologist also benefits when he or she needs to discuss the new product with a legal departments or internal auditors because those groups will already be familiar with the interface.
Q6. What is the LexisNexis Risk Solutions telematics data and analytics platform? And how is it used in the Insurance industry?
Charles Kaminski Jr.: The telematics platform is a horizontally scalable, high performance, big data and analytics platform. It and the associated data is used by carriers who want to understand driving behavior as well as a number of other attributes associated with a policy. Because the platform is format agnostic, carriers have quite a bit of flexibility to use our solutions or bring their own to the table.
I was part of the original team bringing telematics solutions to market. We considered a number of different problems to solve, prototypes, and solutions in those early days. We went through a number of iterations before settling on our first telematics solution. That initial product enabled telematics for carriers by using a consumer’s smart phone, an ODB2 dongle, and LN’s scalable data analytics systems to store and analyze the data. A dedicated telematics team continues to expand our telematics offerings. I’m no longer involved day-to-day.
Q7. According to a Gartner report* referencing its 2015 CIO Study, “eighty-seven percent of CIOs agree that there is a shift to predictive analytics from reporting in their organizations, and 79% believe that the greatest value and insight will come from active experimentation informed by data rather than the passive analysis of data.” What is your take on this?
Charles Kaminski Jr.: Big data and predictive analytics are powerful tools that have transformed a number of industries. For insurance, they are a must. But these tools are now being adopted by a number of other industries and they are sometimes misapplied. There are a number of cautionary case studies in business news where these capabilities were brought into an organization with high cost and high expectations but the investment provided negative returns. Wikibon is reporting that most enterprises expect a return of $3.50 per dollar spent on big data systems but that the actual return to date is more like $0.55 per dollar spent.
My take on this is twofold. First, if you are looking to bring big data and predictive analytics in house, then spend some time choosing the right first business case with a low cost and a low bar to success. This gives you greater flexibility to find scarce resources around big data and predictive modeling, prove out your technology, and fine tune your assumptions. Also, be sure the resources you engage with have experience getting positive returns using big data and analytics.
Second, if you are an executive looking to drive improvements with these tools and you do not currently have a predictive analytics engine, then consider broader trends first. Twenty years ago businesses goals were being managed through results.
Since then there has been a shift towards driving business and organizational improvements using lead measures and lead indicators. This doesn’t necessarily mean predictive analytics. These lead measures and lead indicators can be developed and iterated over quickly without big-data and complex analytics. They can then be used to drive improvements across an enterprise. This can be done before tools such as big data and predictive modeling are introduced.
There are people and firm that can help businesses get started immediately with comparatively low costs.
Charles Kaminski is a Sr. Architect for LexisNexis Risk Solutions. Charles was part of the team that open-sourced the LexisNexis big data platform, HPCC Systems, which is the backbone of LexisNexis Risk Solutions. He now focuses on global markets and international expansion for the company’s Insurance business. Charles has worked for NASA in their Solar System Exploration Division, Accenture’s Financial Services vertical, and was an entrepreneur before joining LexisNexis Risk Solutions. Charles lives outside of Atlanta with his wife and children.
– Introduction to HPCC (High-Performance Computing Cluster). Authors: Anthony M. Middleton, Ph.D. LexisNexis Risk Solutions and Arjuna Chala, Sr. Director Operations, LexisNexis Risk Solutions.ODBMS.org, FEBRUARY 19, 2016
Follow us on Twitter: @odbmsorg
“The best way to explore all The Panama Papers data was using graph database technology, because it’s all relationships, people connected to each other or people connected to companies.” –Mar Cabra.
I have interviewed Mar Cabra, head of the Data & Research Unit of the International Consortium of Investigative Journalists (ICIJ). Main subject of the interview is how the 11.5 million Panama Papers were analysed.
Q1. What is the mission of the International Consortium of Investigative Journalists (ICIJ)?
Mar Cabra: Founded in 1997, the ICIJ is a global network of more than 190 independent journalists in more than 65 countries who collaborate on breaking big investigative stories of global social interest.
Q2. What is your role at ICIJ?
Mar Cabra: I am the Editor at the Data and Research Unit – the desk at the ICIJ that deals with data, analysis and processing, as well as supporting the technology we use for our projects.
Q3. The Panama Papers investigation was based on a 2.6 Terabyte trove of data obtained by Süddeutsche Zeitung and shared with ICIJ and a network of more than 100 media organisations. What was your role in this data investigation?
Mar Cabra: I co-ordinated the work of the team of developers and journalists that first got the leak from Süddeutsche Zeitung, then processed it to make it available online though secure platforms with more than 370 journalists.
I also supervised the data analysis that my team did to enhance and focus the stories. My team was also in charge of the interactive product that we produced for the publication stage of The Panama Papers, so we built an interactive visual application called the ‘Powerplayers’ where we detailed the main stories of the politicians with connections to the offshore world. We also released a game explaining how the offshore world works! Finally, in early May, we updated the offshore database with information about the Panama Papers companies, the 200,000-plus companies connected with Mossack Fonseca.
Q4. The leaked dataset are 11.5 million files from Panamanian law firm Mossack Fonseca. How was all this data analyzed?
Mar Cabra: We relied on Open Source technology and processes that we had worked on in previous projects to process the data. We used Apache Tika to process the documents and also to access them, and created a processing chain of 30 to 40 machines in Amazon Web Services which would process in parallel those documents, then index them onto a document search platform that could be used by 100s of journalists from anywhere in the world.
Q5. Why did you decide to use a graph-based approach for that?
Mar Cabra: Inside the 11.5 million files in the original dataset given to us, there were more than 3 million that came from Mossaka Fonseca’s internal database, which basically contained names of companies in offshore jurisdictions and the people behind them. In other words, that’s a graph! The best way to explore all The Panama Papers data was using graph database technology, because it’s all relationships, people connected to each other or people connected to companies.
Q6. What were the main technical challenges you encountered in analysing such a large dataset?
Mar Cabra: We had already used all the tools that we were using in this investigation, in previous projects. The main issue here was dealing with many more files in many more formats. So the main challenge was how can we make readable all those files, which in many cases were images, in a fast way.
Our next problem was how could we make them understandable to journalists that are not tech savvy. Again, that’s where a graph database became very handy, because you don’t need to be a data scientist to work with a graph representation of a dataset, you just see dots on a screen, nodes, and then just click on them and find the connections – like that, very easily, and without having to hand-code or build queries. I should say you can build queries if you want using Cypher, but you don’t have to.
Q7. What are the similarities with the way you analysed data in the Swiss Leaks story (exposing the fraudulent activity of 100,000 HSBC private bank clients in Switzerland)?
Mar Cabra: We used the same tools for that – a document search platform and a graph database and we used them in combination to find stories. The baseline was the same but the complexity was 100 times more for the Panama Papers. So the technology is the same in principle, but because we were dealing with many more documents, much more complex data, in many more formats, we had to make a lot of improvements in the tools so they really worked for this project. For example, we had to improve the document search platform with a batch search feature, where journalists would upload a list of names and then they would get a list back of links when that list of names had a hit a document.
Q8. Emil Eifrem, CEO, Neo Technology wrote: “If the Panama Papers leak had happened ten years ago, no story would have been written because no one else would have had the technology and skillset to make sense of such a massive dataset at this scale.” What is your take on this?
Mar Cabra: We would have done the Panama Papers papers differently, probably printing the documents – and that would have had a tremendous effect on the paper supplies of the world, because printing out all 11.5 million files would have been crazy! We would have published some stories and the public might have seen some names on the front page of a few newspapers, but the scale and the depth and the understanding of this complex world would not have been able to happen without access to the technology we have today. We would just have not been able to do such an in-depth investigation at a global scale without the technology we have access to now.
Q9. Whistleblowers take incredible risks to help you tell data stories. Why do they do it?
Mar Cabra: Occasionally, some whistleblowers have a grudge and are motivated in more personal terms. Many have been what we call in Spanish ‘widows of power’: people who have been in power and have lost it, and those who wish to expose the competition or have a grudge. Motivations of Whistleblowers vary, but I think there is always an intention to expose injustice. ‘John Doe’ is the source behind the Panama Papers, and a few weeks after we published, he explained his motivation; he wanted to expose an unjust system.
Mar Cabra is the head of ICIJ’s Data & Research Unit, which produces the organization’s key data work and also develops tools for better collaborative investigative journalism. She has been an ICIJ staff member since 2011, and is also a member of the network.
Mar fell in love with data while being a Fulbright scholar and fellow at the Stabile Center for Investigative Journalism at Columbia University in 2009/2010. Since then, she’s promoted data journalism in her native Spain, co-creating the first ever masters degree on investigative reporting, data journalism and visualisation and the national data journalism conference, which gathers more than 500 people every year.
She previously worked in television (BBC, CCN+ and laSexta Noticias) and her work has been featured in the International Herald Tribune, The Huffington Post, PBS, El País, El Mundo or El Confidencial, among others.
In 2012 she received the Spanish Larra Award to the country’s most promising journalist under 30. (PGP public key)
– Offshore Leaks Database: Released in June 2013, the Offshore Leaks Database is a simple search box.
Open Source used for analysing the #PanamaPapers:
– We created a small program ourselves which we called Extract which is actually in our GitHub account that allowed us to do this parallel processing. Extract would get a file and try to see if it could recognize the content. If it couldn’t recognize the content, then we would do OCR and then send it to our document searching platform, which was Apache Solr.
– Based on Apache Solr, we created an index, and then we used Project Blacklight, another open source tool that was originally used for libraries, as our front-end tool. For example, Columbia University Library, where I studied, used this tool.
– Linkurious: Linkurious is software that allows you to visualize graphs very easily. You get a license, you put it in your server, and if you have a database in Neo4j you just plug it in and within hours you have the system set up. It also has this private system where our reporters can login or logout.
– Thanks to another open source tool – in this case Talend – and extractions from a load tool, we were able to easily transform our database into Neo4j, plug in Linkurious and get reporters to search.
–Neo4j: Neo4j is a highly scalable, native graph database purpose-built to leverage not only data but also its relationships. Neo4j’s native graph storage and processing engine deliver constant, real-time performance, helping enterprises build intelligent applications to meet today’s evolving data challenges.
-The good thing about Linkurious is that the reporters or the developers at the other end of the spectrum can also make highly technical Cypher queries if they want to start looking more in depth at the data.
“Data integration isn’t just about moving data from one place to another. It’s about building an actionable, operational view on data that comes from multiple sources so you can integrate the combined data into your operations rather than just looking at it later as you would in a typical warehouse project.” — David Gorbet.
I have interviewed David Gorbet, Senior Vice President,Engineering at MarkLogic. We cover several topics in the interview: Silos, Data integration, data quality, security and the new features of MarkLogic 9.
Q1. Data integration is the number one challenge for many organisations. Why?
David Gorbet: There are three ways to look at that question. First, why do organizations have so many data silos? Second, what’s the motivation to integrate these silos, and third, why is this so hard?
Our Product EVP, Joe Pasqua, did an excellent presentation on the first question at this year’s MarkLogic World. The spoiler is that silos are a natural and inevitable result of an organization’s success. As companies become more successful, they start to grow. As they grow, they need to partition in order to scale. To function, these partitions need to run somewhat autonomously, which inevitably creates silos.
Another way silos enter the picture is what I call “application accretion” or less charitably, “crusty application buildup.” Companies merge, and now they have two HR systems. Divisions acquire special-purpose applications and now they have data that exists only in those applications. IT projects are successful and now need to add capabilities, but it’s easier to bolt them on and move data back and forth than to design them into an existing IT system.
Two years ago I proposed a data-centric view of the world versus an application-centric view. If you think about it, most organizations have a relatively small number of “things” that they care deeply about, but a very large number of “activities” they do with these “things.”
For example, most organizations have customers, but customer-related activities happen all across the organization.
Sales is selling to them. Marketing is messaging to them. Support is helping solve their problems. Finance is billing them. And so on… All these activities are designed to be independent because they take place in organizational silos, and the data silos just reflect that. But the data is all about customers, and each of these activities would benefit greatly from information generated by and maintained in the other silos. Imagine if Marketing could know what customers use the product for to tailor the message, or if Sales knew that the customer was having an issue with the product and was engaged with Support? Sometimes dealing with large organizations feels like dealing with a crazy person with multiple personalities. Organizations that can integrate this data can give their customers a much better, saner experience.
And it’s not just customers. Maybe it’s trades for a financial institution, or chemical compounds for a pharmaceutical company, or adverse events for a life sciences company, or “entities of interest” for an intelligence or police organization. Getting a true, 360-degree view of these things can make a huge difference for these organizations.
In some cases, like with one customer I spoke about in my most recent MarkLogic World keynote who looks at the environment of potentially at-risk children, it can literally mean the difference between life and death.
So why is this so hard? Because most technologies require you to create data models that can accommodate everything you need to know about all of your data in advance, before you can even start the data integration project. They also require you to know the types of queries you’re going to do on that data so you can design efficient schemas and indexing schemes.
This is true even of some NoSQL technologies that require you to figure out sharding and compound indexing schemes in advance of loading your data. As I demonstrated in that keynote I mentioned, even if you have a relatively small set of entities that are quite simple, this is incredibly hard to do.
Usually it’s so hard that instead organizations decide to do a subset of the integration to solve a specific need or answer a specific question. Sadly, this tends to create yet another silo.
Q2. Integrate data from silos: how is it possible?
David Gorbet: Data integration isn’t just about moving data from one place to another. It’s about building an actionable, operational view on data that comes from multiple sources so you can integrate the combined data into your operations rather than just looking at it later as you would in a typical warehouse project.
How do you do that? You build an operational data hub that can consume data from multiple sources and expose APIs on that data so that downstream consumers, either applications or other systems, can consume it in real time. To do this you need an infrastructure that can accommodate the variability across silos naturally, without a lot of up-front data modeling, and without each silo having a ripple effect on all the others.
For the engineers out there (like me), think of this as trying to turn an O(n2) problem into an O(n) problem.
As the number of silos increases, most projects get exponentially more complex, since you can only have one schema and every new silo impacts that schema, which is shared by all data across all existing silos. You want a technology where adding a new data silo does not require re-doing all the work you’ve already done. In addition, you need a flexible technology that allows a flexible data model that can adapt to change. Change in both what data is used and in how it’s used. A system that can evolve with the evolving needs of the business.
MarkLogic can do this because it can ingest data with multiple different schemas and index and query it together.
You don’t have to create one schema that can accommodate all your data. Our built-in application services allows our customers to build APIs that expose the data directly from their data hub and with ACID transactions, these APIs can be used to build real operational applications.
Q3. What is the problem with traditional solutions like relational databases, Extract Transform and Load (ETL) tools?
David Gorbet: To use a metaphor, most technology used for this type of project is like concrete. Now concrete is incredibly versatile. You can make anything you want out of concrete: a bench, a statue, a building, a bridge… But once you’ve made it, you’d better like it because if you want to change it you have to get out the jackhammer.
Many projects that use these tools start out with lofty goals, and they spend a lot of time upfront modeling data and designing schemas. Very quickly they realize that they are not going to be able to make that magical data model that can accommodate everything and be efficiently queried. They start to cut corners to make their problem more tractable, or they design flexible but overly generic models like tall thin tables that are inefficient to query. Every corner they cut limits the types of applications they can then build on the resulting integrated data, and inevitably they end up needing some data they left behind, or needing to execute a query they hadn’t planned (and built an index) for.
Usually at some point they decide to change the model from a hub-and-spoke data integration model to a point-to-point model, because point-to-point integrations are much easier. That, or it evolves as new requirements emerge, and it becomes impossible to keep up by jackhammering the system and starting over. But this just pushes the complexity out of these now point-to-point flows and into the overall system architecture. It also causes huge governance problems, since data now flows in lots of directions and is transformed in many ways that are generally pretty opaque and hard to trace. The inability to capture and query metadata about these data flows causes master-data problems and governance problems, to the point where some organizations genuinely have no idea where potentially sensitive data is being used. The overall system complexity also makes it hard to scale and expensive to operate.
Q4. What are the typical challenges of handling both structured, and unstructured data?
David Gorbet: It’s hard enough to integrate structured data from multiple silos. Everything I’ve already talked about applies even if you have purely structured data. But when some of your data is unstructured, or has a complex, variable structure, it’s much harder. A lot of data has a mix of structured data and unstructured text. Medical records, journal articles, contracts, emails, tweets, specifications, product catalogs, etc. The traditional solution to textual data in a relational world is to put it in an opaque BLOB or CLOB, and then surface its content via a search technology that can crawl the data and build indexes on it. This approach suffers from several problems.
First, it involves stitching together multiple different technologies, each of which has its own operational and governance characteristics. They don’t scale the same way. They don’t have the same security model (unless they have no security model, which is actually pretty common). They don’t have the same availability characteristics or disaster recovery model.
They don’t backup consistently with each other. The indexes are separate, so they can’t be queried together, and keeping them in sync so that they’re consistent is difficult or impossible.
Second, more and more text is being mined for structure. There are technologies that can identify people, places, things, events, etc. in freeform text and structure it. Sentiment analysis is being done to add metadata to text. So it’s no longer accurate to think of text as islands of unstructured data inside a structured record. It’s more like text and structure are inter-mixed at all levels of granularity. The resulting structure is by its nature fluid, and therefore incompatible with the up-front modeling required by relational technology.
Third, search engines don’t index structure unless you tell them to, which essentially involves explaining the “schema” of the text to them so that they can build facets and provide structured search capabilities. So even in your “unstructured” technology, you’re often dealing with schema design.
Finally, as powerful as it is, search technology doesn’t know anything about the semantics of the data. Semantic search enables a much richer search and discovery experience. Look for example at the info box to the right of your Google results. This is provided by Google’s knowledge graph, a graph of data using Semantic Web technologies. If you want to provide this kind of experience, where the system can understand concepts and expand or narrow the context of the search accordingly, you need yet another technology to manage the knowledge graph.
Two years ago at my MarkLogic World keynote I said that search is the query language for unstructured data, so if you have a mix of structured and unstructured data, you need to be able to search and query together. MarkLogic lets you mix structured and unstructured search, as well as semantic search, all in one query, resolved in one technology.
Q5. An important aspect when analysing data is Data Quality. How do you evaluate if the data is of good or of bad quality?
David Gorbet: Data quality is tough, particularly when you’re bringing data together from multiple silos. Traditional technologies require you to transform the data from one schema into another in order to move it from place to place. Every transformation leaves some data behind, and every one has the potential to be a point of data loss or data corruption if the transformation isn’t perfect. In addition, the lineage of the data is often lost. Where did this attribute of this entity come from? When was it extracted? What was the transform that was run on it? What did it look like before?
All of this is lost in the ETL process. The best way to ensure data quality is to always bring along with each record the original, untransformed data, as well as metadata tracing its provenance, lineage and context.
MarkLogic lets you do this, because our flexible schema accommodates source data, canonicalized (transformed) data, and metadata all in the same record, and all of it is queryable together. So if you find a bug in your transform, it’s easy to query for all impacted records, and because you have the source data there, you can easily fix it as well.
In addition, our Bitemporal feature can trace changes to a record over time, and let you query your data as it is, as it was, or as you thought it was at any given point in time or over any historical (or in some cases future) time range. So you have traceability when your data changes, and you can understand how and why it has changed.
Q6. Data leakage is another problem for many corporations that experienced high profile security incidents. What can be done to solve this problem?
David Gorbet: Security is another important aspect of data governance. And security isn’t just about locking all your data in a vault and only letting some people look at it. Security is more granular than that. There are some data that can be seen by just about anyone in your organization. Some that should only be seen by people who need it, and some that should be hidden from all but people with specific roles. In some cases, even users with a particular role should not see data unless they have a provable need in addition to the role required. This is called “compartment security,” meaning you have to be in a certain compartment to see data, regardless of your role or clearance overall.
There is a principle in security called “defense in depth.” Basically it means pushing the security to the lowest layer possible in the stack. That’s why it’s critically important that your DBMS have strong and granular security features.
This is especially true if you’re integrating data from silos, each of which may have its own security rules.
You need your integrated data hub to be able to observe and enforce those rules, regardless of how complex they are.
Increasingly the concern is over the so-called “insider threat.” This is the employee, contractor, vendor, managed service provider, or cloud provider who has access to your infrastructure. Another good reason not to implement security in your application, because if you do, any DBA will be able to circumvent it. Today, with the move to cloud and other outsourced infrastructure, organizations are also concerned about what’s on the file system. Even if you secure your data at the DBMS layer, a system administrator with file system access can still get at it. To counter this, more organizations are requiring “at rest” encryption of data, which means that the data is encrypted on the file system. A good implementation will require a separate role to manage encryption keys, different from the DBA or SA roles, along with a separate key management technology. In our implementation, MarkLogic never even sees the database encryption keys, relying instead on a separate key management system (KMS) to unlock data for us. This separation of concerns is a lot more secure, because it would require insiders to collude across functions and organizations to steal data. You can even keep your data in the cloud and your keys on-premises, or with another managed service provider.
Q8. What is new in MarkLogic® 9 database? ?
David Gorbet: There’s so much in MarkLogic 9 it’s hard to cover all of it. That presentation I referenced earlier from Joe does a pretty good job of summarizing the features. Many of the features in MarkLogic 9 are designed to make data integration even easier. MarkLogic 9 has new ways of modeling data that can keep it in its flexible document form, but project it into tabular form for more traditional analysis (aggregates, group-bys, joins, etc.) using either SQL or a NoSQL API we call the Optic API. This allows you to define the structured parts of your data and let MarkLogic index it in a way that makes it most efficient to query and aggregate.
You can also use this technique to extract RDF triples from your data, giving you easy access to the full power of Semantics technologies.
We’re doing more to make it easier to get data into MarkLogic via a new data movement SDK that you can hook directly up to your data pipeline. This SDK can help orchestrate transformations and parallel loads of data no matter where it comes from.
We’re also doubling down on security. Earlier I mentioned encryption at rest. That’s a new feature for MarkLogic 9.
We’re also doing sub-record-level role- and compartment-based access control. This means that if you have a record (like a customer record) that you want to make broadly available, but there is some data in that record (like a SSN) that you want to restrict access to, you can easily do that. You can also obfuscate and transform data within a record to redact it for export or for use in a context that is less secure than MarkLogic.
Security is a governance feature, and we’re improving other governance features as well, with policy-based tiering for lifecycle management, and improvements to our Bitemporal feature that make it a full-fledged compliance feature.
We’re introducing new tools to help monitor and manage multiple clusters at a time. And we’re making many other improvements in many other areas, like our new geospatial region index that makes region-region queries much faster, improvements to tools like Query Console and MLCP, and many, many more.
One exciting feature that is a bit hard to understand at first is our new Entity Services feature. You can think of this as a catalog of entities. You can put whatever you want in this catalog. Entity attributes, relationships, etc. but also policies, governance rules, and other entity class metadata. This is a queryable semantic model, so you can query your catalog at runtime in your application. We’ll also be providing tools that use this catalog to help build the right set of indexes, indexing templates, APIs, etc. for your specific data. Over time, Entity Services will become the foundation of our vision of the “smart database.” You’ll hear us start talking a lot more about that soon.
David Gorbet, Senior Vice President, Engineering, MarkLogic.
David Gorbet has the best job in the world. As SVP of Engineering, David manages the team that delivers the MarkLogic product and supports our customers as they use it to power their amazing applications. Working with all those smart, talented engineers as they pour their passion into our product is a humbling experience, and seeing the creativity and vision of our customers and how they’re using our product to change their industry is simply awesome.
Prior to MarkLogic, David helped pioneer Microsoft’s business online services strategy by founding and leading the SharePoint Online team. In addition to SharePoint Online, David has held a number of positions at Microsoft and elsewhere with a number of enterprise server products and applications, and numerous incubation products.
David holds a Bachelor of Applied Science Degree in Systems Design Engineering with an additional major in Psychology from the University of Waterloo, and an MBA from the University of Washington Foster School of Business.
–Join the Early Access program for a MarkLogic 9 introduction by visiting: ea.marklogic.com
-The MarkLogic Developer License is free to all who sign up and join the MarkLogic developer community.
– On Data Governance. Interview with David Saul. ODBMS Industry Watch, 2016-07-23
– On Data Interoperability. Interview with Julie Lockner. ODBMS Industry Watch, 2016-06-07
– On Data Analytics and the Enterprise. Interview with Narendra Mulani. ODBMS Industry Watch, 2016-05-24
Follow us on Twitter: @odbmsorg
“Any important data driving a business decision needs to be sanity checked, just as it would if one was using a spreadsheet.”–Dave Thomas.
I have interviewed Dave Thomas,Chief Scientist at Kx Labs.
Q1. For many years business users have had their data locked up in databases and data warehouses. What is wrong with that?
Dave Thomas: It isn’t so much an issue of where the data resides, whether it is in files, databases, data warehouses or a modern data lake. The challenge is that modern businesses need access to the raw data, as well as the ability to rapidly aggregate and analyze their data.
Q2. Typical business intelligence (BI) tool users have never seen their actual data. Why?
Dave Thomas: For large corporations hardware and software both used to be prohibitively expensive, hence much of their data was aggregated prior to making it available to users. Even today when machines are very inexpensive most corporate IT infrastructures are impoverished relative to what one can buy on the street or in the Cloud.
Compounding the problem, IT charge-back mechanisms are biased to reduce IT spending rather than to maximize the value of data delivered to the business.
Traditional technologies are not sufficiently performant to allow processing of large volumes of data.
Many companies have inexpensive data lakes and have realized after the fact that using a commodity storage systems, such as HDFS, has severely constrained their performance and limited their utility. Hence more corporations are moving data away from HDFS into high-performance storage or memory.
Q3. What are the limitations of the existing BI and extract, transform and load (ETL) data tools?
Dave Thomas: Traditional BI tools assume that it is possible for DBAs and BI experts to a priori define the best way to structure and query the data. This reduces the whole power of BI to mere reporting. In an attempt to deal with huge BI backlogs, generic query and reporting tools have become popular to shift reporting to self-serve. However, they are often designed for sophisticated BI users rather than for normal business users. They are often not performant because they depend on the implementation of the underlying data stores.
For the most part, existing ETL tools are constrained by having to move the data to the ETL process and then on to the end user. Many ETL tools only work against one kind of data source. ETL can’t be written by normal users and due to the cost of an incorrect ETL run, such tools are not available to the data analyst. One of the major topics of discussion in Big Data shops is the complexity and performance of their Big Data pipeline. ETL, data blending, shouldn’t be a separate process or product. It should be something one can do with queries in a single efficient data language.
Q4. What are the typical technical challenges in finance, IoT and other time-series applications?
1. Speed, as data volumes and variety are always increasing.
2. Ability to deal with both real-time events and historical events efficiently. Ideally in a single technology.
3. To handle time-series one needs to be able to deal with simultaneous arrival of events. Time with nanosecond precision is our solution. Other solutions are constrained by using milliseconds and event counters that are much less efficient.
4. High-performance operations on time, over days, months and years are essential for time-series. This is why time is a native type in Kx.
5. The essence of time-series is processing sliding time windows of data for both joins and aggregations.
6. In IOT, data is always dirty. Kx’s native support for missing data and out of band data due to failing sensors, allows one to deal with the realities of sensor data.
Q5. Kx offers analysts a language called q. Why not extend standard SQL?
Dave Thomas: I think there is a misunderstanding about q. Q is a full functional data language that both includes and extends SQL. Selects are easier than SQL because they provide implicit joins and group-bys. This makes queries roughly 50% of the code of SQL. Unlike many flavors of SQL, q lets one put a functional expression in any position in an SQL statement. One can easily extend the aggregation operations available to the end-user.
Q6. Can you show the difference between a query written in q and in standard SQL?
Dave Thomas: Here’s an example of retrieving parts from an orders table with a foreign key join to a parts table, summing by quantity and then sorting by color:
select sum qty by p.color from sp
select p.color, sum(sp.qty) from sp, p
where sp.p=p.p group by p.color order by color
Q7. How do queries execute inside the database?
Dave Thomas: Q is native to the database engine. Hence queries and analytics execute in the columns of the Kx database. There is no data shipping between the client and database server.
Q8. Shawn Rogers of Dell said: “A ‘citizen data scientist’ is an everyday, non-technical user that lacks the statistical and analytical prowess of a traditional data scientist, but is equally eager to leverage data in order to uncover insights, and importantly, do so at the speed of business.” What is your take on this?
Dave Thomas: High-performance data technologies, such as Kx, using modern large-memory hardware, can support data analysts versus data scientist queries. In the product Analyst for Kx, for example, users can work interactively on a sample of data using visual tools to import, clean, query, transform, analyze and visualize data with minimal, if any programming or even SQL. Given correct operations on one or more samples they then can be run against trillions of rows of data. Data analysts today can truly live in their data.
Q9. What are the risks of bringing the power of analytics to users who are non-expert programmers?
Dave Thomas: Clearly any important analysis needs to be validated and cross-checked. Hence any important data driving a business decision needs to be sanity checked, just as it would if one was using a spreadsheet.
In our experience users do make initial mistakes, but as they live in their data they quickly learn.
Visualization really helps, as does the provision of metadata about the data sources. Reducing the cycle time provides increased understanding, and allows one to make mistakes.
Runaway query performance has been a concern of DBAs, but for many years frameworks have been in place such as our smart query router that will ensure that ad hoc queries against massive datasets are throttled so they don’t run away. Fortunately, recent cost reductions in non-volatile memory make it possible to have high-performance query-only replicas of data that can be made available to different parts of the organization based on its needs.
Q10. How can non-expert programmers understand if the information expressed in visual analytics such as heat maps or in operational dashboard charts, is of good quality or not?
Dave Thomas: In our experience users spot visual anomalies much faster than inconsistencies in a spreadsheet.
Q11. What are the opportunities arising in “democratizing” the use of massive data sets?
Dave Thomas: We are finally living in a world where for many companies it is possible to run a real-time business where everyone can have fast, efficient access to the data they need. Rather than being held hostage to aggregations, spreadsheets and all sorts of variants of the truth, the organization can expediently see new opportunities to improve results in sales, marketing, production and other business operations.
Q12. How important is data query and data semantics?
Dave Thomas: Unfortunately we are not educated on how to express data semantics and data query.
Even computer scientists often study less about writing queries than how to execute them efficiently.
We need to educate students and employees on how to live in their data. It may well be that the future of programming for most will be writing queries. Given powerful data languages even compiler optimizations can be expressed by queries.
We need to invest much more in data governance and the use of standard terminology in order to share data within and across companies.
Dave Thomas, Kx Labs.
As Chief Scientist Dave envisions the future roadmap for Kx tools. Dave has had a long and storied career in computer software development and is perhaps best known as the founder and past CEO of Object Technology International, formerly OTI, now IBM OTI Labs, a pioneer in Agile Product Development. He was the principal visionary and architect for IBM VisualAge Smalltalk and Java tools and virtual machines including the popular open-source, multi-language Eclipse.org IDE. As the cofounder of Bedarra Research Labs he led the creation of the Ivy visual analytics workbench. Dave is a renowned speaker, university lecturer and Chairman of the Australian developer YOW! conferences.
– New Kx release includes encryption, enhanced compression and Tableau integration. ODBMS.org JULY 4, 2016.
–Kdb+ and the Internet of Things/Big Data. InDetail Paper by Bloor Research Author: Philip Howard. ODBMS.org- JANUARY 28, 2015
– Democratizing fast access to Big Data. By Dave Thomas, chief scientist at Kx Labs. ODBMS.org-April 26, 2016
–On Data Governance. Interview with David Saul. ODBMS Industry Watch, Published on 2016-07-23
–On the Challenges and Opportunities of IoT. Interview with Steve Graves. ODBMS Industry Watch, Published on 2016-07-06
–On Data Analytics and the Enterprise. Interview with Narendra Mulani. ODBMS Industry Watch, Published on 2016-05-24
Follow us on Twitter: @odbmsorg
“Isn’t it ironic that in 2016 a non-skilled user can find a web page from Google’s untold petabytes of data in millisecond time, but a highly trained SQL expert can’t do the same thing in a relational database one billionth the size?.–Jim Starkey.
I have interviewed Jim Starkey. A database legend, Jim’s career as an entrepreneur, architect, and innovator spans more than three decades of database history.
Q1. In your opinion, what are the most significant advances in databases in the last few years?
Jim Starkey: I’d have to say the “atom programming model” where a database is layered on a substrate of peer-to-peer replicating distributed objects rather than disk files. The atom programming model enables scalability, redundancy, high availability, and distribution not available in traditional, disk-based database architectures.
Q2. What was your original motivation to invent the NuoDB Emergent Architecture?
Jim Starkey: It all grew out of a long Sunday morning shower. I knew that the performance limits of single-computer database systems were in sight, so distributing the load was the only possible solution, but existing distributed systems required that a new node copy a complete database or partition before it could do useful work. I started thinking of ways to attack this problem and came up with the idea of peer to peer replicating distributed objects that could be serialized for network delivery and persisted to disk. It was a pretty neat idea. I came out much later with the core architecture nearly complete and very wrinkled (we have an awesome domestic hot water system).
Q3. In your career as an entrepreneur and architect what was the most significant innovation you did?
Jim Starkey: Oh, clearly multi-generational concurrency control (MVCC). The problem I was trying to solve was allowing ad hoc access to a production database for a 4GL product I was working on at the time, but the ramifications go far beyond that. MVCC is the core technology that makes true distributed database systems possible. Transaction serialization is like Newtonian physics – all observers share a single universal reference frame. MVCC is like special relativity, where each observer views the universe from his or her reference frame. The views appear different but are, in fact, consistent.
Q4. Proprietary vs. open source software: what are the pros and cons?
Jim Starkey: It’s complicated. I’ve had feet in both camps for 15 years. But let’s draw a distinction between open source and open development. Open development – where anyone can contribute – is pretty good at delivering implementations of established technologies, but it’s very difficult to push the state of the art in that environment. Innovation, in my experience, requires focus, vision, and consistency that are hard to maintain in open development. If you have a controlled development environment, the question of open source versus propriety is tactics, not philosophy. Yes, there’s an argument that having the source available gives users guarantees they don’t get from proprietary software, but with something as complicated as a database, most users aren’t going to try to master the sources. But having source available lowers the perceived risk of new technologies, which is a big plus.
Q5. You led the Falcon project – a transactional storage engine for the MySQL server- through the acquisition of MySQL by Sun Microsystems. What impact did it have this project in the database space?
Jim Starkey: In all honesty, I’d have to say that Falcon’s most important contribution was its competition with InnoDB. In the end, that competition made InnoDB three times faster. Falcon, multi-version in memory using the disk for backfill, was interesting, but no matter how we cut it, it was limited by the performance of the machine it ran on. It was fast, but no single node database can be fast enough.
Q6. What are the most challenging issues in databases right now?
Jim Starkey: I think it’s time to step back and reexamine the assumptions that have accreted around database technology – data model, API, access language, data semantics, and implementation architectures. The “relational model”, for example, is based on what Codd called relations and we call tables, but otherwise have nothing to do with his mathematic model. That model, based on set theory, requires automatic duplicate elimination. To the best of my knowledge, nobody ever implemented Codd’s model, but we still have tables which bear a scary resemblance to decks of punch cards. Are they necessary? Or do they just get in the way?
Isn’t it ironic that in 2016 a non-skilled user can find a web page from Google’s untold petabytes of data in millisecond time, but a highly trained SQL expert can’t do the same thing in a relational database one billionth the size?. SQL has no provision for flexible text search, no provision for multi-column, multi-table search, and no mechanics in the APIs to handle the results if it could do them. And this is just one a dozen problems that SQL databases can’t handle. It was a really good technical fit for computers, memory, and disks of the 1980’s, but is it right answer now?
Q7. How do you see the database market evolving?
Jim Starkey: I’m afraid my crystal ball isn’t that good. Blobs, another of my creations, spread throughout the industry in two years. MVCC took 25 years to become ubiquitous. I have a good idea of where I think it should go, but little expectation of how or when it will.
Qx. Anything else you wish to add?
Jim Starkey: Let me say a few things about my current project, AmorphousDB, an implementation of the Amorphous Data Model (meaning, no data model at all). AmorphousDB is my modest effort to question everything database.
The best way to think about Amorphous is to envision a relational database and mentally erase the boxes around the tables so all records free float in the same space – including data and metadata. Then, if you’re uncomfortable, add back a “record type” attribute and associated syntactic sugar, so table-type semantics are available, but optional. Then abandon punch card data semantics and view all data as abstract and subject to search. Eliminate the fourteen different types of numbers and strings, leaving simply numbers and strings, but add useful types like URL’s, email addresses, and money. Index everything unless told not to. Finally, imagine an API that fits on a single sheet of paper (OK, 9 point font, both sides) and an implementation that can span hundreds of nodes. That’s AmorphousDB.
Jim Starkey invented the NuoDB Emergent Architecture, and developed the initial implementation of the product. He founded NuoDB [formerly NimbusDB] in 2008, and retired at the end of 2012, shortly before the NuoDB product launch.
Jim’s career as an entrepreneur, architect, and innovator spans more than three decades of database history from the Datacomputer project on the fledgling ARPAnet to his most recent startup, NuoDB, Inc. Through the period, he has been
responsible for many database innovations from the date data type to the BLOB to multi-version concurrency control (MVCC). Starkey has extensive experience in proprietary and open source software.
Starkey joined Digital Equipment Corporation in 1975, where he created the Datatrieve family of products, the DEC Standard Relational Interface architecture, and the first of the Rdb products, Rdb/ELN. Starkey was also software architect for DEC’s database machine group.
Leaving DEC in 1984, Starkey founded Interbase Software to develop relational database software for the engineering workstation market. Interbase was a technical leader in the database industry producing the first commercial implementations of heterogeneous networking, blobs, triggers, two phase commit, database events, etc. Ashton-Tate acquired Interbase Software in 1991, and was, in turn, acquired by Borland International a few months later. The Interbase database engine was released open source by Borland in 2000 and became the basis for the Firebird open source database project.
In 2000, Starkey founded Netfrastructure, Inc., to build a unified platform for distributable, high quality Web applications. The Netfrastructure platform included a relational database engine, an integrated search engine, an integrated Java virtual machine, and a high performance page generator.
MySQL, AB, acquired Netfrastructure, Inc. in 2006 to be the kernel of a wholly owned transactional storage engine for the MySQL server, later known as Falcon. Starkey led the Falcon project through the acquisition of MySQL by Sun Microsystems.
Jim has a degree in Mathematics from the University of Wisconsin.
For amusement, Jim codes on weekends, while sailing, but not while flying his plane.
Follow us on Twitter: @odbmsorg
“Intelligent system designers do have ethical responsibilities.”
I have interviewed John Markoff, technology writer at The New York Times.
In 2013 he was awarded a Pulitzer Prize.
The interview is related to his recent book “Machines of Loving Grace: The Quest for Common Ground Between Humans and Robots, published in August of 2015 by HarperCollins Ecco.
Q1. Do you share the concerns of prominent technology leaders such as Tesla’s chief executive, Elon Musk, who suggested we might need to regulate the development of artificial intelligence?
John Markoff: I share their concerns, but not their assertions that we may be on the cusp of some kind of singularity or rapid advance to artificial general intelligence. I do think that machine autonomy raises specific ethical and safety concerns and regulation is an obvious response.
Q2. How difficult is it to reconcile the different interests of the people who are involved in a direct or indirect way in developing and deploying new technology?
John Markoff: This is why we have governments and governmental regulation. I think AI, in that respect is no different than any other technology. It should and can be regulated when human safety is at stake.
Q3. In your book Machines of Loving Grace you argued that “we must decide to design ourselves into our future, or risk being excluded from it altogether”. What do you mean by that?
John Markoff: You can use AI technologies either to automate or to augment humans. The problem is minimized when you take an approach that is based on human centric design principles.
Q4. How is it possible in practice? Isn’t the technology space dominated by giants such as IBM, Apple,Google who dictate the direction of new technology?
John Markoff: This is a very interesting time with “giant” technology companies realizing that there are consequences in the deployment of these technologies. Google, IBM and Microsoft have all recently made public commitments to the safe use of AI.
Q5. What are the most significant new developments in the humans-computers area, that are likely to have a significant influence in our daily life in the near future?
John Markoff: One of the best things about being a reporter is that you don’t have to predict the future. You only have to note what the various visionaries say, so you can call that to their attention when their predictions prove inaccurate. With that caveat, if I am forced to bet on any particular information technology it would be augmented reality. This is because I believe that multi-touch interfaces for mobile devices simply can’t be the last step in user interface.
Q6. Do you believe that robots will really transform modern life?
John Markoff: I struggle with the definition of what is a “robot.” If something is tele-operated, for example, is it a robot? That said I think that we will increasingly be surrounded by machines that perform tasks.
The question is will they come as quickly as Silicon Valley seems to believe. My friend Paul Saffo has said, “Never mistake a clear view for a short distance.” And I think that is the case with all kinds of mobile robots, including self driving cars.
Q7. For the designers of Intelligent Systems, how difficult is to draw a line between what is human and what is machine?
John Markoff: I feel strongly that the possibility of designing cyborgs, particularly with respect to intellectual prosthesis is a boundary we should cross with great caution. Remember the Borg from StarTrek. “Resistance is futile, you will be assimilated.” I think the challenge is to use these systems to enhance human thought, not for social control.
Q8. What are the ethical responsibilities of designers of intelligent systems?
John Markoff: I think the most important aspect of that question is the simple acknowledgement that intelligent system designers do have ethical responsibilities. That has not always been the case, but it seems to be a growing force within the community of AI and robotics designers in the past five years, so I’m not entirely pessimistic.
Q9. If humans delegate decisions to machines, who will be responsible for the consequences?
John Markoff: Ben Shneiderman, the University of Maryland computer scientist and user interface designer has written eloquently on this point. Indeed he argues against autonomous systems for precisely this reason. His point is that it is essential to keep a human in the loop. If not you run the risk of abdicating ethical responsibility for system design.
Q10. Assuming there is a real potential in using data–driven methods to both help charities develop better services and products, and understand civil society activity. In your opinion, what are the key lessons and recommendations for future work in this space?
John Markoff: I’m afraid I’m not an expert in the IT needs of either charities or NGOs. That said a wide range of AI advances are already being delivered at nominal cost via smart phones. As cheap sensors proliferate virtually all everyday objects will gain intelligence that will be widely accessible.
Qx. Anything else you wish to add?
John Markoff: Only that I think it is interesting that the augmentation vs automation dichotomy is increasingly seen as a path through which to navigate the impact of these technologies. Computer system designers are the ones who will decide what the impact of these technologies are and whether to replace or augment humans in society.
JOHN GREGORY MARKOFF
John Markoff joined The New York Times in March 1988 as a reporter for the business section. He is now a technology writer based in San Francisco bureau of the paper. Prior to joining the Times, he worked for The San Francisco Examiner from 1985 to 1988. He reported for the New York Times Science Section from 2010 to 2015.
Markoff has written about technology and science since 1977. He covered technology and the defense industry for The Pacific News Service in San Francisco from 1977 to 1981; he was a reporter at Infoworld from 1981 to 1983; he was the West Coast editor for Byte Magazine from 1984 to 1985 and wrote a column on personal computers for The San Jose Mercury from 1983 to 1985.
He has also been a lecturer at the University of California at Berkeley School of Journalism and an adjunct faculty member of the Stanford Graduate Program on Journalism.
The Times nominated him for a Pulitzer Prize in 1995, 1998 and 2000. The San Francisco Examiner nominated him for a Pulitzer in 1987. In 2005, with a group of Times reporters, he received the Loeb Award for business journalism. In 2007 he shared the Society of American Business Editors and Writers Breaking News award. In 2013 he was awarded a Pulitzer Prize in explanatory reporting as part of a New York Times project on labor and automation.
In 2007 he became a member of the International Media Council at the World Economic Forum. Also in 2007, he was named a fellow of the Society of Professional Journalists, the organization’s highest honor.
In June of 2010 the New York Times presented him with the Nathaniel Nash Award, which is given annually for foreign and business reporting.
Born in Oakland, California on October 29, 1949, Markoff grew up in Palo Alto, California and graduated from Whitman College, Walla Walla, Washington, in 1971. He attended graduate school at the University of Oregon and received a masters degree in sociology in 1976.
Markoff is the co-author of “The High Cost of High Tech,” published in 1985 by Harper & Row. He wrote “Cyberpunk: Outlaws and Hackers on the Computer Frontier” with Katie Hafner, which was published in 1991 by Simon & Schuster.
In January of 1996 Hyperion published “Takedown: The Pursuit and Capture of America’s Most Wanted Computer Outlaw,” which he co-authored with Tsutomu Shimomura. “What the Dormouse Said: How the Sixties Counterculture shaped the Personal Computer Industry,” was published in 2005 by Viking Books. “Machines of Loving Grace: The Quest for Common Ground Between Humans and Robots,” was published in August of 2015 by HarperCollins Ecco.
He is currently researching a biography of Stewart Brand.
He is married to Leslie Terzian Markoff and they live in San Francisco, Calif.
MACHINES OF LOVING GRACE – The Quest for Common Ground Between Humans and Robots By John Markoff, Illustrated. 378 pp. Ecco/HarperCollins Publishers.
–Shneiderman’s “Eight Golden Rules of Interface Design”. These rules were obtained from the text Designing the User Interface by Ben Shneiderman.
– “Designing the User Interface”, 6th Edition. This is a revised edition of the highly successful textbook on Human Computer Interaction originally developed by Ben Shneiderman and Catherine Plaisant at the University of Maryland.
– Recruit Institute of Technology. Interview with Alon Halevy. ODBMS Industry Watch, Published on 2016-04-02
– Civility in the Age of Artificial Intelligence, by STEVE LOHR, technology reporter for The New York Times, ODBMS.org
– On Artificial Intelligence and Society. Interview with Oren Etzioni, ODBMS Industry Watch.
– On Big Data and Society. Interview with Viktor Mayer-Schönberger, ODBMS Industry Watch.
Follow us on Twitter: @odbmsorg