ODBMS Industry Watch » Data http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Sun, 02 Apr 2017 17:59:10 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.13 Big Data and The Great A.I. Awakening. Interview with Steve Lohr http://www.odbms.org/blog/2016/12/big-data-and-the-great-a-i-awakening-interview-with-steve-lohr/ http://www.odbms.org/blog/2016/12/big-data-and-the-great-a-i-awakening-interview-with-steve-lohr/#comments Mon, 19 Dec 2016 08:35:56 +0000 http://www.odbms.org/blog/?p=4274

“I think we’re just beginning to grapple with implications of data as an economic asset” –Steve Lohr.

My last interview for this year is with Steve Lohr. Steve Lohr has covered technology, business, and economics for the New York Times for more than twenty years. In 2013 he was part of the team awarded the Pulitzer Prize for Explanatory Reporting. We discussed Big Data and how it influences the new Artificial Intelligence awakening.

Wishing you all the best for the Holiday Season and a healthy and prosperous New Year!

RVZ

Q1. Why do you think Google (TensorFlow) and Microsoft (Computational Network Toolkit) are open-sourcing their AI software?

Steve Lohr: Both Google and Microsoft are contributing their tools to expand and enlarge the AI community, which is good for the world and good for their businesses. But I also think the move is a recognition that algorithms are not where their long-term advantage lies. Data is.

Q2. What are the implications of that for both business and policy?

Steve Lohr: The companies with big data pools can have great economic power. Today, that shortlist would include Google, Microsoft, Facebook, Amazon, Apple and Baidu.
I think we’re just beginning to grapple with implications of data as an economic asset. For example, you’re seeing that now with Microsoft’s plan to buy LinkedIn, with its personal profiles and professional connections for more than 400 million people. In the evolving data economy, is that an antitrust issue of concern?

Q3. In this competing world of AI, what is more important, vast data pools, sophisticated algorithms or deep pockets?

Steve Lohr: The best answer to that question, I think, came from a recent conversation with Andrew Ng, a Stanford professor who worked at GoogleX, is co-founder of Coursera and is now chief scientist at Baidu. I asked him why Baidu, and he replied there were only a few places to go to be a leader in A.I. Superior software algorithms, he explained, may give you an advantage for months, but probably no more. Instead, Ng said, you look for companies with two things — lots of capital and lots of data. “No one can replicate your data,” he said. “It’s the defensible barrier, not algorithms.”

Q4. What is the interplay and implications of big data and artificial intelligence?

Steve Lohr: The data revolution has made the recent AI advances possible. We’ve seen big improvements in the last few years, for example, in AI tasks like speech recognition and image recognition, using neural network and deep learning techniques. Those technologies have been around for decades, but they are getting a huge boost from the abundance of training data because of all the web image and voice data that can be tapped now.

Q5. Is data science really only a here-and-now version of AI?

Steve Lohr: No, certainly not only. But I do find that phrase a useful way to explain to most of my readers — intelligent people, but not computer scientists — the interplay between data science and AI. To convey that rudiments of data-driven AI are already all around us. It’s not — surely not yet — robot armies and self-driving cars as fixtures of everyday life. But it is internet search, product recommendations, targeted advertising and elements of personalized medicine, to cite a few examples.

Q6. Technology is moving beyond increasing the odds of making a sale, to being used in higher-stakes decisions like medical diagnosis, loan approvals, hiring and crime prevention. What are the societal implications of this?

Steve Lohr: The new, higher-stakes decisions that data science and AI tools are increasingly being used to make — or assist in making — are fundamentally different than marketing and advertising. In marketing and advertising, a decision that is better on average is plenty good enough. You’ve increased sales and made more money. You don’t really have to know why.
But the other decisions you mentioned are practically and ethically very different. These are crucial decisions about individual people’s lives. Better on average isn’t good enough. For these kinds of decisions, issues of accuracy, fairness and discrimination come into play.
That, I think, argues for two things. First, some sort of auditing tool; the technology has to be able to explain itself, to explain how a data-driven algorithm came to the decision or recommendation that it did.
Second, I think it argues for having a “human in the loop” for most of these kinds of decisions for the foreseeable future.

Q7. Will data analytics move into the mainstream of the economy (far beyond the well known, born-on-the-internet success stories like Google, Facebook and Amazon)?

Steve Lohr: Yes, and I think we’re seeing that now in nearly every field — health care, agriculture, transportation, energy and others. That said, it is still very early. It is a phenomenon that will play out for years, and decades.
Recently, I talked to Jeffrey Immelt, the chief executive of General Electric, America’s largest industrial company. GE is investing heavily to put data-generating sensors on its jet engines, power turbines, medical equipment and other machines — and to hire software engineers and data scientists.
Immelt said if you go back more than a century to the origins of the company, dating back to Thomas Edison‘s days, GE’s technical foundation has been materials science and physics. Data analytics, he said, will be the third fundamental technology for GE in the future.
I think that’s a pretty telling sign of where things are headed.

—————————–
Steve Lohr has covered technology, business, and economics for the New York Times for more than twenty years and writes for the Times’ Bits blog. In 2013 he was part of the team awarded the Pulitzer Prize for Explanatory Reporting.
He was a foreign correspondent for a decade and served as an editor, and has written for national publications such as the New York Times Magazine, the Atlantic, and the Washington Monthly. He is the author of Go To: The Story of the Math Majors, Bridge Players, Engineers, Chess Wizards, Maverick Scientists, Iconoclasts—the Programmers Who Created the Software Revolution and Data-ism The Revolution Transforming Decision Making, Consumer Behavior, and Almost Everything Else.
He lives in New York City.

————————–

Resources

Google (TensorFlow): TensorFlow™ is an open source software library for numerical computation using data flow graphs.

Microsoft (Computational Network Toolkit): A free, easy-to-use, open-source, commercial-grade toolkit that trains deep learning algorithms to learn like the human brain.

Data-ism The Revolution Transforming Decision Making, Consumer Behavior, and Almost Everything Else. by Steve Lohr. 2016 HarperCollins Publishers

Related Posts

Don’t Fear the Robots. By STEVE LOHR. -OCT. 24, 2015-The New York Times, SundayReview | NEWS ANALYSIS

G.E., the 124-Year-Old Software Start-Up. By STEVE LOHR. -AUG. 27, 2016- The New York Times, TECHNOLOGY

Machines of Loving Grace. Interview with John Markoff. ODBMS Industry Watch, Published on 2016-08-11

Recruit Institute of Technology. Interview with Alon Halevy. ODBMS Industry Watch, Published on 2016-04-02

Civility in the Age of Artificial Intelligence, by STEVE LOHR, technology reporter for The New York Times, ODBMS.org

On Artificial Intelligence and Society. Interview with Oren Etzioni, ODBMS Industry Watch.

On Big Data and Society. Interview with Viktor Mayer-Schönberger, ODBMS Industry Watch.

Follow us on Twitter:@odbmsorg

##

]]>
http://www.odbms.org/blog/2016/12/big-data-and-the-great-a-i-awakening-interview-with-steve-lohr/feed/ 1
How the 11.5 million Panama Papers were analysed. Interview with Mar Cabra http://www.odbms.org/blog/2016/10/how-the-11-5-million-panama-papers-were-analysed-interview-with-mar-cabra/ http://www.odbms.org/blog/2016/10/how-the-11-5-million-panama-papers-were-analysed-interview-with-mar-cabra/#comments Tue, 11 Oct 2016 17:54:36 +0000 http://www.odbms.org/blog/?p=4214

“The best way to explore all The Panama Papers data was using graph database technology, because it’s all relationships, people connected to each other or people connected to companies.” –Mar Cabra.

I have interviewed Mar Cabra, head of the Data & Research Unit of the International Consortium of Investigative Journalists (ICIJ). Main subject of the interview is how the 11.5 million Panama Papers were analysed.

RVZ

Q1. What is the mission of the International Consortium of Investigative Journalists (ICIJ)?

Mar Cabra: Founded in 1997, the ICIJ is a global network of more than 190 independent journalists in more than 65 countries who collaborate on breaking big investigative stories of global social interest.

Q2. What is your role at ICIJ?

Mar Cabra: I am the Editor at the Data and Research Unit – the desk at the ICIJ that deals with data, analysis and processing, as well as supporting the technology we use for our projects.

Q3. The Panama Papers investigation was based on a 2.6 Terabyte trove of data obtained by Süddeutsche Zeitung and shared with ICIJ and a network of more than 100 media organisations. What was your role in this data investigation?

Mar Cabra: I co-ordinated the work of the team of developers and journalists that first got the leak from Süddeutsche Zeitung, then processed it to make it available online though secure platforms with more than 370 journalists.
I also supervised the data analysis that my team did to enhance and focus the stories. My team was also in charge of the interactive product that we produced for the publication stage of The Panama Papers, so we built an interactive visual application called the ‘Powerplayers’ where we detailed the main stories of the politicians with connections to the offshore world. We also released a game explaining how the offshore world works! Finally, in early May, we updated the offshore database with information about the Panama Papers companies, the 200,000-plus companies connected with Mossack Fonseca.

Q4. The leaked dataset are 11.5 million files from Panamanian law firm Mossack Fonseca. How was all this data analyzed?

Mar Cabra: We relied on Open Source technology and processes that we had worked on in previous projects to process the data. We used Apache Tika to process the documents and also to access them, and created a processing chain of 30 to 40 machines in Amazon Web Services which would process in parallel those documents, then index them onto a document search platform that could be used by 100s of journalists from anywhere in the world.

Q5. Why did you decide to use a graph-based approach for that?

Mar Cabra: Inside the 11.5 million files in the original dataset given to us, there were more than 3 million that came from Mossaka Fonseca’s internal database, which basically contained names of companies in offshore jurisdictions and the people behind them. In other words, that’s a graph! The best way to explore all The Panama Papers data was using graph database technology, because it’s all relationships, people connected to each other or people connected to companies.

Q6. What were the main technical challenges you encountered in analysing such a large dataset?

Mar Cabra: We had already used all the tools that we were using in this investigation, in previous projects. The main issue here was dealing with many more files in many more formats. So the main challenge was how can we make readable all those files, which in many cases were images, in a fast way.
Our next problem was how could we make them understandable to journalists that are not tech savvy. Again, that’s where a graph database became very handy, because you don’t need to be a data scientist to work with a graph representation of a dataset, you just see dots on a screen, nodes, and then just click on them and find the connections – like that, very easily, and without having to hand-code or build queries. I should say you can build queries if you want using Cypher, but you don’t have to.

Q7. What are the similarities with the way you analysed data in the Swiss Leaks story (exposing the fraudulent activity of 100,000 HSBC private bank clients in Switzerland)?

Mar Cabra: We used the same tools for that – a document search platform and a graph database and we used them in combination to find stories. The baseline was the same but the complexity was 100 times more for the Panama Papers. So the technology is the same in principle, but because we were dealing with many more documents, much more complex data, in many more formats, we had to make a lot of improvements in the tools so they really worked for this project. For example, we had to improve the document search platform with a batch search feature, where journalists would upload a list of names and then they would get a list back of links when that list of names had a hit a document.

Q8. Emil Eifrem, CEO, Neo Technology wrote: “If the Panama Papers leak had happened ten years ago, no story would have been written because no one else would have had the technology and skillset to make sense of such a massive dataset at this scale.” What is your take on this?

Mar Cabra: We would have done the Panama Papers papers differently, probably printing the documents – and that would have had a tremendous effect on the paper supplies of the world, because printing out all 11.5 million files would have been crazy! We would have published some stories and the public might have seen some names on the front page of a few newspapers, but the scale and the depth and the understanding of this complex world would not have been able to happen without access to the technology we have today. We would just have not been able to do such an in-depth investigation at a global scale without the technology we have access to now.

Q9. Whistleblowers take incredible risks to help you tell data stories. Why do they do it?

Mar Cabra: Occasionally, some whistleblowers have a grudge and are motivated in more personal terms. Many have been what we call in Spanish ‘widows of power’: people who have been in power and have lost it, and those who wish to expose the competition or have a grudge. Motivations of Whistleblowers vary, but I think there is always an intention to expose injustice. ‘John Doe’ is the source behind the Panama Papers, and a few weeks after we published, he explained his motivation; he wanted to expose an unjust system.

————————–
Mar Cabra is the head of ICIJ’s Data & Research Unit, which produces the organization’s key data work and also develops tools for better collaborative investigative journalism. She has been an ICIJ staff member since 2011, and is also a member of the network.

Mar fell in love with data while being a Fulbright scholar and fellow at the Stabile Center for Investigative Journalism at Columbia University in 2009/2010. Since then, she’s promoted data journalism in her native Spain, co-creating the first ever masters degree on investigative reporting, data journalism and visualisation  and the national data journalism conference, which gathers more than 500 people every year.

She previously worked in television (BBC, CCN+ and laSexta Noticias) and her work has been featured in the International Herald Tribune, The Huffington Post, PBS, El País, El Mundo or El Confidencial, among others.
In 2012 she received the Spanish Larra Award to the country’s most promising journalist under 30. (PGP public key)

Resources

– Panama Papers Source Offers Documents To Governments, Hints At More To Come. International Consortium of Investigative Journalists. May 6, 2016

The Panama Papers. ICIJ

– The two journalists from Sueddeutsche ZeitungFrederik Obermaier and Bastian Obermayer

– Offshore Leaks Database: Released in June 2013, the Offshore Leaks Database is a simple search box.

Open Source used for analysing the #PanamaPapers:

– Oxwall: We found an open source social network tool called Oxwall that we tweaked to our advantage. We basically created a private social network for our reporters.

– Apache Tika and Tesseract to do optical character recognition (OCR),

– We created a small program ourselves which we called Extract which is actually in our GitHub account that allowed us to do this parallel processing. Extract would get a file and try to see if it could recognize the content. If it couldn’t recognize the content, then we would do OCR and then send it to our document searching platform, which was Apache Solr.

– Based on Apache Solr, we created an index, and then we used Project Blacklight, another open source tool that was originally used for libraries, as our front-end tool. For example, Columbia University Library, where I studied, used this tool.

– Linkurious: Linkurious is software that allows you to visualize graphs very easily. You get a license, you put it in your server, and if you have a database in Neo4j you just plug it in and within hours you have the system set up. It also has this private system where our reporters can login or logout.

– Thanks to another open source tool – in this case Talend – and extractions from a load tool, we were able to easily transform our database into Neo4j, plug in Linkurious and get reporters to search.

Neo4j: Neo4j is a highly scalable, native graph database purpose-built to leverage not only data but also its relationships. Neo4j’s native graph storage and processing engine deliver constant, real-time performance, helping enterprises build intelligent applications to meet today’s evolving data challenges.

-The good thing about Linkurious is that the reporters or the developers at the other end of the spectrum can also make highly technical Cypher queries if they want to start looking more in depth at the data.

Related Posts

##

]]>
http://www.odbms.org/blog/2016/10/how-the-11-5-million-panama-papers-were-analysed-interview-with-mar-cabra/feed/ 0
On Silos, Data Integration and Data Security. Interview with David Gorbet http://www.odbms.org/blog/2016/09/on-silos-data-integration-and-data-security-interview-with-david-gorbet/ http://www.odbms.org/blog/2016/09/on-silos-data-integration-and-data-security-interview-with-david-gorbet/#comments Fri, 23 Sep 2016 20:02:51 +0000 http://www.odbms.org/blog/?p=4229

“Data integration isn’t just about moving data from one place to another. It’s about building an actionable, operational view on data that comes from multiple sources so you can integrate the combined data into your operations rather than just looking at it later as you would in a typical warehouse project.” — David Gorbet.

I have interviewed David Gorbet, Senior Vice President,Engineering at MarkLogic. We cover several topics in the interview: Silos, Data integration, data quality, security and the new features of MarkLogic 9.

RVZ

Q1. Data integration is the number one challenge for many organisations. Why?

David Gorbet: There are three ways to look at that question. First, why do organizations have so many data silos? Second, what’s the motivation to integrate these silos, and third, why is this so hard?

Our Product EVP, Joe Pasqua, did an excellent presentation on the first question at this year’s MarkLogic World. The spoiler is that silos are a natural and inevitable result of an organization’s success. As companies become more successful, they start to grow. As they grow, they need to partition in order to scale. To function, these partitions need to run somewhat autonomously, which inevitably creates silos.
Another way silos enter the picture is what I call “application accretion” or less charitably, “crusty application buildup.” Companies merge, and now they have two HR systems. Divisions acquire special-purpose applications and now they have data that exists only in those applications. IT projects are successful and now need to add capabilities, but it’s easier to bolt them on and move data back and forth than to design them into an existing IT system.

Two years ago I proposed a data-centric view of the world versus an application-centric view. If you think about it, most organizations have a relatively small number of “things” that they care deeply about, but a very large number of “activities” they do with these “things.”
For example, most organizations have customers, but customer-related activities happen all across the organization.
Sales is selling to them. Marketing is messaging to them. Support is helping solve their problems. Finance is billing them. And so on… All these activities are designed to be independent because they take place in organizational silos, and the data silos just reflect that. But the data is all about customers, and each of these activities would benefit greatly from information generated by and maintained in the other silos. Imagine if Marketing could know what customers use the product for to tailor the message, or if Sales knew that the customer was having an issue with the product and was engaged with Support? Sometimes dealing with large organizations feels like dealing with a crazy person with multiple personalities. Organizations that can integrate this data can give their customers a much better, saner experience.

And it’s not just customers. Maybe it’s trades for a financial institution, or chemical compounds for a pharmaceutical company, or adverse events for a life sciences company, or “entities of interest” for an intelligence or police organization. Getting a true, 360-degree view of these things can make a huge difference for these organizations.
In some cases, like with one customer I spoke about in my most recent MarkLogic World keynote who looks at the environment of potentially at-risk children, it can literally mean the difference between life and death.

So why is this so hard? Because most technologies require you to create data models that can accommodate everything you need to know about all of your data in advance, before you can even start the data integration project. They also require you to know the types of queries you’re going to do on that data so you can design efficient schemas and indexing schemes.
This is true even of some NoSQL technologies that require you to figure out sharding and compound indexing schemes in advance of loading your data. As I demonstrated in that keynote I mentioned, even if you have a relatively small set of entities that are quite simple, this is incredibly hard to do.
Usually it’s so hard that instead organizations decide to do a subset of the integration to solve a specific need or answer a specific question. Sadly, this tends to create yet another silo.

Q2. Integrate data from silos: how is it possible?

David Gorbet: Data integration isn’t just about moving data from one place to another. It’s about building an actionable, operational view on data that comes from multiple sources so you can integrate the combined data into your operations rather than just looking at it later as you would in a typical warehouse project.

How do you do that? You build an operational data hub that can consume data from multiple sources and expose APIs on that data so that downstream consumers, either applications or other systems, can consume it in real time. To do this you need an infrastructure that can accommodate the variability across silos naturally, without a lot of up-front data modeling, and without each silo having a ripple effect on all the others.
For the engineers out there (like me), think of this as trying to turn an O(n2) problem into an O(n) problem.
As the number of silos increases, most projects get exponentially more complex, since you can only have one schema and every new silo impacts that schema, which is shared by all data across all existing silos. You want a technology where adding a new data silo does not require re-doing all the work you’ve already done. In addition, you need a flexible technology that allows a flexible data model that can adapt to change. Change in both what data is used and in how it’s used. A system that can evolve with the evolving needs of the business.

MarkLogic can do this because it can ingest data with multiple different schemas and index and query it together.
You don’t have to create one schema that can accommodate all your data. Our built-in application services allows our customers to build APIs that expose the data directly from their data hub and with ACID transactions, these APIs can be used to build real operational applications.

Q3. What is the problem with traditional solutions like relational databases, Extract Transform and Load (ETL) tools?

David Gorbet: To use a metaphor, most technology used for this type of project is like concrete. Now concrete is incredibly versatile. You can make anything you want out of concrete: a bench, a statue, a building, a bridge… But once you’ve made it, you’d better like it because if you want to change it you have to get out the jackhammer.

Many projects that use these tools start out with lofty goals, and they spend a lot of time upfront modeling data and designing schemas. Very quickly they realize that they are not going to be able to make that magical data model that can accommodate everything and be efficiently queried. They start to cut corners to make their problem more tractable, or they design flexible but overly generic models like tall thin tables that are inefficient to query. Every corner they cut limits the types of applications they can then build on the resulting integrated data, and inevitably they end up needing some data they left behind, or needing to execute a query they hadn’t planned (and built an index) for.

Usually at some point they decide to change the model from a hub-and-spoke data integration model to a point-to-point model, because point-to-point integrations are much easier. That, or it evolves as new requirements emerge, and it becomes impossible to keep up by jackhammering the system and starting over. But this just pushes the complexity out of these now point-to-point flows and into the overall system architecture. It also causes huge governance problems, since data now flows in lots of directions and is transformed in many ways that are generally pretty opaque and hard to trace. The inability to capture and query metadata about these data flows causes master-data problems and governance problems, to the point where some organizations genuinely have no idea where potentially sensitive data is being used. The overall system complexity also makes it hard to scale and expensive to operate.

Q4. What are the typical challenges of handling both structured, and unstructured data?

David Gorbet: It’s hard enough to integrate structured data from multiple silos. Everything I’ve already talked about applies even if you have purely structured data. But when some of your data is unstructured, or has a complex, variable structure, it’s much harder. A lot of data has a mix of structured data and unstructured text. Medical records, journal articles, contracts, emails, tweets, specifications, product catalogs, etc. The traditional solution to textual data in a relational world is to put it in an opaque BLOB or CLOB, and then surface its content via a search technology that can crawl the data and build indexes on it. This approach suffers from several problems.

First, it involves stitching together multiple different technologies, each of which has its own operational and governance characteristics. They don’t scale the same way. They don’t have the same security model (unless they have no security model, which is actually pretty common). They don’t have the same availability characteristics or disaster recovery model.
They don’t backup consistently with each other. The indexes are separate, so they can’t be queried together, and keeping them in sync so that they’re consistent is difficult or impossible.

Second, more and more text is being mined for structure. There are technologies that can identify people, places, things, events, etc. in freeform text and structure it. Sentiment analysis is being done to add metadata to text. So it’s no longer accurate to think of text as islands of unstructured data inside a structured record. It’s more like text and structure are inter-mixed at all levels of granularity. The resulting structure is by its nature fluid, and therefore incompatible with the up-front modeling required by relational technology.

Third, search engines don’t index structure unless you tell them to, which essentially involves explaining the “schema” of the text to them so that they can build facets and provide structured search capabilities. So even in your “unstructured” technology, you’re often dealing with schema design.

Finally, as powerful as it is, search technology doesn’t know anything about the semantics of the data. Semantic search enables a much richer search and discovery experience. Look for example at the info box to the right of your Google results. This is provided by Google’s knowledge graph, a graph of data using Semantic Web technologies. If you want to provide this kind of experience, where the system can understand concepts and expand or narrow the context of the search accordingly, you need yet another technology to manage the knowledge graph.

Two years ago at my MarkLogic World keynote I said that search is the query language for unstructured data, so if you have a mix of structured and unstructured data, you need to be able to search and query together. MarkLogic lets you mix structured and unstructured search, as well as semantic search, all in one query, resolved in one technology.

Q5. An important aspect when analysing data is Data Quality. How do you evaluate if the data is of good or of bad quality?

David Gorbet: Data quality is tough, particularly when you’re bringing data together from multiple silos. Traditional technologies require you to transform the data from one schema into another in order to move it from place to place. Every transformation leaves some data behind, and every one has the potential to be a point of data loss or data corruption if the transformation isn’t perfect. In addition, the lineage of the data is often lost. Where did this attribute of this entity come from? When was it extracted? What was the transform that was run on it? What did it look like before?
All of this is lost in the ETL process. The best way to ensure data quality is to always bring along with each record the original, untransformed data, as well as metadata tracing its provenance, lineage and context.
MarkLogic lets you do this, because our flexible schema accommodates source data, canonicalized (transformed) data, and metadata all in the same record, and all of it is queryable together. So if you find a bug in your transform, it’s easy to query for all impacted records, and because you have the source data there, you can easily fix it as well.

In addition, our Bitemporal feature can trace changes to a record over time, and let you query your data as it is, as it was, or as you thought it was at any given point in time or over any historical (or in some cases future) time range. So you have traceability when your data changes, and you can understand how and why it has changed.

Q6. Data leakage is another problem for many corporations that experienced high profile security incidents. What can be done to solve this problem?

David Gorbet: Security is another important aspect of data governance. And security isn’t just about locking all your data in a vault and only letting some people look at it. Security is more granular than that. There are some data that can be seen by just about anyone in your organization. Some that should only be seen by people who need it, and some that should be hidden from all but people with specific roles. In some cases, even users with a particular role should not see data unless they have a provable need in addition to the role required. This is called “compartment security,” meaning you have to be in a certain compartment to see data, regardless of your role or clearance overall.

There is a principle in security called “defense in depth.” Basically it means pushing the security to the lowest layer possible in the stack. That’s why it’s critically important that your DBMS have strong and granular security features.
This is especially true if you’re integrating data from silos, each of which may have its own security rules.
You need your integrated data hub to be able to observe and enforce those rules, regardless of how complex they are.

Increasingly the concern is over the so-called “insider threat.” This is the employee, contractor, vendor, managed service provider, or cloud provider who has access to your infrastructure. Another good reason not to implement security in your application, because if you do, any DBA will be able to circumvent it. Today, with the move to cloud and other outsourced infrastructure, organizations are also concerned about what’s on the file system. Even if you secure your data at the DBMS layer, a system administrator with file system access can still get at it. To counter this, more organizations are requiring “at rest” encryption of data, which means that the data is encrypted on the file system. A good implementation will require a separate role to manage encryption keys, different from the DBA or SA roles, along with a separate key management technology. In our implementation, MarkLogic never even sees the database encryption keys, relying instead on a separate key management system (KMS) to unlock data for us. This separation of concerns is a lot more secure, because it would require insiders to collude across functions and organizations to steal data. You can even keep your data in the cloud and your keys on-premises, or with another managed service provider.

Q8. What is new in MarkLogic® 9 database? ?

David Gorbet: There’s so much in MarkLogic 9 it’s hard to cover all of it. That presentation I referenced earlier from Joe does a pretty good job of summarizing the features. Many of the features in MarkLogic 9 are designed to make data integration even easier. MarkLogic 9 has new ways of modeling data that can keep it in its flexible document form, but project it into tabular form for more traditional analysis (aggregates, group-bys, joins, etc.) using either SQL or a NoSQL API we call the Optic API. This allows you to define the structured parts of your data and let MarkLogic index it in a way that makes it most efficient to query and aggregate.
You can also use this technique to extract RDF triples from your data, giving you easy access to the full power of Semantics technologies.
We’re doing more to make it easier to get data into MarkLogic via a new data movement SDK that you can hook directly up to your data pipeline. This SDK can help orchestrate transformations and parallel loads of data no matter where it comes from.

We’re also doubling down on security. Earlier I mentioned encryption at rest. That’s a new feature for MarkLogic 9.
We’re also doing sub-record-level role- and compartment-based access control. This means that if you have a record (like a customer record) that you want to make broadly available, but there is some data in that record (like a SSN) that you want to restrict access to, you can easily do that. You can also obfuscate and transform data within a record to redact it for export or for use in a context that is less secure than MarkLogic.

Security is a governance feature, and we’re improving other governance features as well, with policy-based tiering for lifecycle management, and improvements to our Bitemporal feature that make it a full-fledged compliance feature.
We’re introducing new tools to help monitor and manage multiple clusters at a time. And we’re making many other improvements in many other areas, like our new geospatial region index that makes region-region queries much faster, improvements to tools like Query Console and MLCP, and many, many more.

One exciting feature that is a bit hard to understand at first is our new Entity Services feature. You can think of this as a catalog of entities. You can put whatever you want in this catalog. Entity attributes, relationships, etc. but also policies, governance rules, and other entity class metadata. This is a queryable semantic model, so you can query your catalog at runtime in your application. We’ll also be providing tools that use this catalog to help build the right set of indexes, indexing templates, APIs, etc. for your specific data. Over time, Entity Services will become the foundation of our vision of the “smart database.” You’ll hear us start talking a lot more about that soon.

—————–

David Gorbet, Senior Vice President, Engineering, MarkLogic.

David Gorbet has the best job in the world. As SVP of Engineering, David manages the team that delivers the MarkLogic product and supports our customers as they use it to power their amazing applications. Working with all those smart, talented engineers as they pour their passion into our product is a humbling experience, and seeing the creativity and vision of our customers and how they’re using our product to change their industry is simply awesome.

Prior to MarkLogic, David helped pioneer Microsoft’s business online services strategy by founding and leading the SharePoint Online team. In addition to SharePoint Online, David has held a number of positions at Microsoft and elsewhere with a number of enterprise server products and applications, and numerous incubation products.

David holds a Bachelor of Applied Science Degree in Systems Design Engineering with an additional major in Psychology from the University of Waterloo, and an MBA from the University of Washington Foster School of Business.

Resources

Join the Early Access program for a MarkLogic 9 introduction by visiting: ea.marklogic.com

-The MarkLogic Developer License is free to all who sign up and join the MarkLogic developer community.

Related Posts

– On Data Governance. Interview with David Saul. ODBMS Industry Watch,  2016-07-23

– On Data Interoperability. Interview with Julie Lockner. ODBMS Industry Watch, 2016-06-07

– On Data Analytics and the Enterprise. Interview with Narendra Mulani. ODBMS Industry Watch, 2016-05-24

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/09/on-silos-data-integration-and-data-security-interview-with-david-gorbet/feed/ 0
On Data Analytics and the Enterprise. Interview with Narendra Mulani. http://www.odbms.org/blog/2016/05/on-data-analytics-and-the-enterprise-interview-with-narendra-mulani/ http://www.odbms.org/blog/2016/05/on-data-analytics-and-the-enterprise-interview-with-narendra-mulani/#comments Tue, 24 May 2016 16:31:20 +0000 http://www.odbms.org/blog/?p=4144

“A hybrid technology infrastructure that combines existing analytics architecture with new big data technologies can help companies to achieve superior outcomes.”–Narendra Mulani

I have interviewed Narendra MulaniChief Analytics Officer, Accenture Analytics. Main topics of our interview are: Data Analytics, Big Data, the Internet of Things, and their repercussion for the enterprise.

RVZ

Q1. What is your role at Accenture?

Narendra Mulani: I’m the Chief Analytics Officer at Accenture Analytics and I am responsible for building and inspiring a culture of analytics and driving Accenture’s strategic agenda for growth across the business. I lead a team of analytics professionals around the globe that are dedicated to helping clients transform into insight-driven enterprises and focused on creating value through innovative solutions that combine industry and functional knowledge with analytics and technology.

With the constantly increasing amount of data and new technologies becoming available, it truly is an exciting time for Accenture and our clients alike. I’m thrilled to be collaborating with my team and clients and taking part, first-hand, in the power of analytics and the positive disruption it is creating for businesses around globe.

Q2. What are the main drivers you see in the market for Big Data Analytics?

Narendra Mulani: Companies across industries are fighting to secure or keep their lead in the marketplace.
To excel in this competitive environment, they are looking to exploit one of their growing assets: Data.
Organizations see big data as a catalyst for their transformation into digital enterprises and as a way to secure an insight-driven competitive advantage. In particular, big data technologies are enabling companies with greater agility as it helps them to analyze data comprehensively and take more informed actions at a swifter pace. We’ve already passed the transition point with big data – instead of discussing the possibilities with big data, many are already experiencing the actual insight-driven benefits from it, including increased revenues, a larger base of loyal customers, and more efficient operations. In fact, we see our clients looking for granular solutions that leverage big data, advanced analytics and the cloud to address industry specific problems.

Q3. Analytics and Mobility: how do they correlate?

Narendra Mulani: Analytics and mobility are two digital areas that work hand-in-hand on many levels.
As an example, mobile devices and the increasingly connected world through the Internet of Things (IoT) have become two key drivers for big data analytics. As mobile devices, sensors, and the IoT are constantly creating new data sources and data types, big data analytics is being applied to transform the increasing amount of data into important and actionable insight that can create new business opportunities and outcomes. Also, this view can be reversed, where analytics feeds insight into mobile devices such as tablets to workers in offices or out in the field to enable them to make real-time decisions that could benefit their business.

Q4. Data explosion: What does it create ? Risks, Value or both?

Narendra Mulani: The data explosion that’s happening today and will continue to happen due to the Internet of Things creates a lot of opportunity for businesses. While organizations recognize the value that the data can generate, the sheer amount of data – internal data, external data, big data, small data, etc – can be overwhelming and create an obstacle for analytics adoption, project completion, and innovation. To overcome this challenge and pursue actionable insights and outcomes, organizations shouldn’t look to analyze all of the data that’s available, but identify the right data needed to solve the current project or challenge at hand to create value.

It’s also important for companies to manage the potential risk associated with the influx of data and take the steps needed to optimize and protect it. They can do this by aligning IT and business leads to jointly develop and maintain data governance and security strategies. At a high level, the strategies would govern who uses the data and how the data is analyzed and leveraged, define the technologies that would manage and analyze the data, and ensure the data is secured with the necessary standards. Suitable governance and security strategies should be requirements for insight-driven businesses. Without them, organizations could experience adverse and counter-productive results.

Q5. You introduced the concept of the “Modern Data Supply Chain”? How does it differ from the traditional Supply Chain?

Narendra Mulani: As companies’ data ecosystems are usually very complex with many data silos, a modern data supply chain helps them to simplify their data environment and generate the most value from their data. In brief, when data is treated as a supply chain, it can flow swiftly, easily and usefully through the entire organization— and also through its ecosystem of partners, including customers and suppliers.

To establish an effective modern data supply chain, companies should create a hybrid technology environment that enables a data service platform with emerging big data technologies. As a result, businesses will be able to access, manage, move, mobilize and interact with broader and deeper data sets across the organization at a much quicker pace than previously possible and place action on the attained analytics insights that could help it to more effectively deliver to its consumers, develop new innovative solutions, and differentiate in its market.

Q6. You talked about “Retooling the Enterprise”. What do you mean by this?

Narendra Mulani: Some businesses today are no longer just using analytics, they are taking the next step by transforming into insight-driven enterprises. To achieve “insight-driven enterprise” status, organizations need to retool themselves for optimization. They can pursue an insight-driven transformation by:

· Establishing a center of gravity for analytics – a center of gravity for analytics often takes the shape of a Center of Excellence or a similar concentration of talent and resources.
· Employing agile governance – build horizontal governance structures that are focused on outcomes and speed to value, and take a “test and learn” approach to rolling out new capabilities. A secure governance foundation could also improve the democratization of data throughout a business.
· Creating an inter-disciplinary high performing analytics team — field teams with diverse skills, organize talent effectively, and create innovative programs to keep the best talent engaged.
· Deploying new capabilities faster – deploy new, modern and agile technologies, as well as hybrid architectures and specifically designed toolsets, to help revolutionize how data has been traditionally managed, curated and consumed, to achieve speed to capability and desired outcomes. When appropriate, cloud technologies should be integrated into the IT mix to benefit from cloud-based usage models.
· Raising the company’s analytics IQ – have a vision of what would be your “intelligent enterprise” and implement an Analytics Academy that provides analytics training for functional business resources in addition to the core management training programs.

Q7. What are the risks from the Internet of Things? And how is it possible to handle such risks?

Narendra Mulani: The IoT is prompting an even greater focus on data security and privacy. As a company’s machines, employees and ecosystems of partners, providers, and customers become connected through the IoT, securing the data that is flowing across the IoT grid can be increasingly complex. Today’s sophisticated cyber attackers are also amplifying this complexity as they are constantly evolving and leveraging data technology to challenge a company’s security efforts.

To establish strong, effective real-time cyber defense strategy, security teams will need to employ innovative technologies to identify threat behavioral patterns — including artificial intelligence, automation, visualisation, and big data analytics – and an agile and fluid workforce to leverage the opportunities presented by technology innovations. They should also establish policies to address privacy issues that arise out of all the personal data that are being collected. Through this combination of efforts, companies will be able to strengthen its approach to cyber defense in today’s highly connected IoT world and empower cyber defenders to help their companies better anticipate and respond to cyber attacks.

Q8. What are the main lessons you have learned in implementing Big Data Analytic projects?

Narendra Mulani: Organizations should explore the entire big data technology ecosystem, take an outcome-focused approach to addressing specific business problems, and establish precise success metrics before an analytics project even begins. The big data landscape is in a constant state of change with new data sources and emerging big data technologies appearing every day that could offer a company a new value-generating opportunity. A hybrid technology infrastructure that combines existing analytics architecture with new big data technologies can help companies to achieve superior outcomes.
An outcome-focused strategy that embraces analytics experimentation and explores the possible data and technology that can help a company meet its goals and has checkpoints for measuring performance will be very valuable, as this strategy will help the analytics team to know if they should continue on course or need to make a course correction to attain the desired outcome.

Q9. Is Data Analytics only good for businesses? What about using (Big) Data for Societal issues?

Narendra Mulani: Analytics is helping businesses across industries and governments as well to make more informed decisions for effective outcomes, whether it might be to improve customer experience, healthcare or public safety.
As an example, we’re working with a utility company in the UK to help them leverage analytics insights to anticipate equipment failures and respond in near real-time to critical situations, such as leaks or adverse weather events. We are also working with a government agency to analyze its video monitoring feeds to identify potential public safety risks.

Qx Anything else you wish to add?

Narendra Mulani: Another area that’s on the rise is Artificial Intelligence – we define it as a collection of multiple technologies that enable machines to sense, comprehend, act and learn, either on their own or to augment human activities. The new technologies include machine learning, deep learning, natural language processing, video analytics and more. AI is disrupting how businesses operate and compete and we believe it will also fundamentally transform and improve how we work and live. When an organization is pursuing an AI project, it’s our belief that it should be business-oriented, people-focused, and technology rich for it to be most effective.

———

As Chief Analytics Officer and Head Geek – Accenture Analytics, Narendra Mulani is responsible for creating a culture of analytics and driving Accenture’s strategic agenda for growth across the business. He leads a dedicated team of 17,000 Analytic professionals that serve clients around the globe, focusing on value creation through innovative solutions that combine industry and functional knowledge with analytics and technology.

Narendra has held a number of leadership roles within Accenture since joining in 1997. Most recently, he was the managing director – Products North America, where he was responsible for creating value for our clients across a number of industries. Prior to that, he was managing director – Supply Chain, Accenture Management Consulting, leading a global practice responsible for defining and implementing supply chain capabilities at a diverse set of Fortune 500 clients.

Narendra graduated from Bombay University in 1978 with a Bachelor of Commerce, and received an MBA in Finance in 1982 as well as a PhD in 1985 focused on Multivariate Statistics, both from the University of Massachusetts.

Outside of work, Narendra is involved with various activities that support education and the arts. He lives in Connecticut with his wife Nita and two children, Ravi and Nikhil.

———-

Resources

– Ducati is Analytics Driven. Analytics takes Ducati around the world at speed and precision.

Accenture Analytics. Launching an insights-driven transformation.  Download the point of view on analytics operating models to better understand how high performing companies are organizing their capabilities.

– Accenture Cyber Intelligence Platform. Analytics helping organizations to continuously predict, detect and combat cyber attacks.

–  Data Acceleration: Architecture for the Modern Data Supply Chain, Accenture

Related Posts

On Big Data and Data Science. Interview with James KobielusSource: ODBMS Industry Watch,  2016-04-19

On the Internet of Things. Interview with Colin Mahony Source: ODBMS Industry Watch, 2016-03-14

A Grand Tour of Big Data. Interview with Alan MorrisonSource: ODBMS Industry Watch, 2016-02-25

On the Industrial Internet of Things. Interview with Leon GuzendaSource: ODBMS Industry Watch,  2016-01-28

On Artificial Intelligence and Society. Interview with Oren EtzioniSource: ODBMS Industry Watch,  2016-01-15

 

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/05/on-data-analytics-and-the-enterprise-interview-with-narendra-mulani/feed/ 0
Recruit Institute of Technology. Interview with Alon Halevy http://www.odbms.org/blog/2016/04/recruit-institute-of-technology-interview-with-alon-halevy/ http://www.odbms.org/blog/2016/04/recruit-institute-of-technology-interview-with-alon-halevy/#comments Sat, 02 Apr 2016 15:10:02 +0000 http://www.odbms.org/blog/?p=4112

” A revolution will happen when tools like Siri can truly serve as your personal assistant and you start relying on such an assistant throughout your day. To get there, these systems need more knowledge about your life and preferences, more knowledge about the world, better conversational interfaces and at least basic commonsense reasoning capabilities. We’re still quite far from achieving these goals.”–Alon Halevy

I have interviewed Alon Halevy, CEO at Recruit Institute of Technology.

RVZ

Q1. What is the mission of the Recruit Institute of Technology?

Alon Halevy: Before I describe the mission, I should introduce our parent company Recruit Holdings to those who may not be familiar with it. Recruit (founded in 1960), is a leading “life-style” information services and human resources company in Japan with services in the areas of recruitment, advertising, employment placement, staffing, education, housing and real estate, bridal, travel, dining, beauty, automobiles and others. The company is currently expanding worldwide and operates similar businesses in the U.S., Europe and Asia. In terms of size, Recruit has over 30,000 employees and its revenues are similar to those of Facebook at this point in time.

The mission of R.I.T is threefold. First, being the lab of Recruit Holdings, our goal is to develop technologies that improve the products and services of our subsidiary companies and create value for our customers from  the vast collections of data we have. Second, our mission is to advance scientific knowledge by contributing to the research community through publications in top-notch venues. Third, we strive to use technology for social good. This latter goal may be achieved through contributing to open-source software, working on digital artifacts that would be of general use to society, or even working with experts in a particular domain to contribute to a cause.

Q2. Isn`t similar to the mission of the Allen Institute for Artificial Intelligence?

Alon Halevy: The Allen Institute is a non-profit whose admirable goal is to make fundamental contributions to Artificial Intelligence. While R.I.T strives to make fundamental contributions to A.I and related areas such as data management, we plan to work closely with our subsidiary companies and to impact the world through their products.

Q3. Driverless cars, digital Personal Assistants (e.g. Siri), Big Data, the Internet of Things, Robots: Are we on the brink of the next stage of the computer revolution?

Alon Halevy: I think we are seeing many applications in which AI and data (big or small) are starting to make a real difference and affecting people’s lives. We will see much more of it in the next few years as we refine our techniques. A revolution will happen when tools like Siri can truly serve as your personal assistant and you start relying on such an assistant throughout your day. To get there, these systems need more knowledge about your life and preferences, more knowledge about the world, better conversational interfaces and at least basic commonsense reasoning capabilities. We’re still quite far from achieving these goals.

Q4. You were for more than 10 years senior staff research scientist at Google, leading the Structured Data Group in Google Research. Was it difficult to leave Google?

Alon Halevy: It was extremely difficult leaving Google! I struggled with the decision for quite a while, and waving goodbye to my amazing team on my last day was emotionally heart wrenching. Google is an amazing company and I learned so much from my colleagues there. Fortunately, I’m very excited about my new colleagues and the entrepreneurial spirit of Recruit.
One of my goals at R.I.T is to build a lab with the same culture as that of Google and Google Research. So in a sense, I’m hoping to take Google with me. Some of my experiences from a decade at Google that are relevant to building a successful research lab are described in a blog post I contributed to the SIGMOD blog in September, 2015.

Q5. What is your vision for the next three years for the Recruit Institute of Technology?

Alon Halevy: I want to build a vibrant lab with world-class researchers and engineers. I would like the lab to become a world leader in the broad area of making data usable, which includes data discovery, cleaning, integration, visualization and analysis.
In addition, I would like the lab to build collaborations with disciplines outside of Computer Science where computing techniques can make an even broader impact on society.

Q6. What are the most important research topics you intend to work on?

Alon Halevy: One of the roadblocks to applying AI and analysis techniques more widely within enterprises is data preparation.
Before you can analyze data or apply AI techniques to it, you need to be able to discover which datasets exist in the enterprise, understand the semantics of a dataset and its underlying assumptions, and to combine disparate datasets as needed. We plan to work on the full spectrum of these challenges with the goal of enabling many more people in the enterprise to explore their data.

Recruit being a lifestyle company, another  fundamental question we plan to investigate is whether technology can help people make better life decisions. In particular, can technology help you take into consideration many factors in your life as you make decisions and steer you towards decisions that will make you happier over time. Clearly, we’ll need more than computer scientists to even ask the right questions here.

Q7. If we delegate decisions to machines, who will be responsible for the consequences? What are the ethical responsibilities of designers of intelligent systems?

Alon Halevy: You got an excellent answer from Oren Etzioni to this question in a recent interview. I agree with him fully and could not say it any better than he did.

Qx Anything you wish to add?

Alon Halevy: Yes. We’re hiring! If you’re a researcher or strong engineer who wants to make real impact on products and services in the fascinating area of lifestyle events and decision making, please consider R.I.T!

———-

Alon Halevy is the Executive Director of the Recruit Institute of Technology. From 2005 to 2015 he headed the Structured Data Management Research group at Google. Prior to that, he was a professor of Computer Science at the University of Washington in Seattle, where he founded the Database Group. In 1999, Dr. Halevy co-founded Nimble Technology, one of the first companies in the Enterprise Information Integration space, and in 2004, Dr. Halevy founded Transformic, a company that created search engines for the deep web, and was acquired by Google.
Dr. Halevy is a Fellow of the Association for Computing Machinery, received the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2000, and was a Sloan Fellow (1999-2000). Halevy is the author of the book “The Infinite Emotions of Coffee”, published in 2011, and serves on the board of the Alliance of Coffee Excellence.
He is also a co-author of the book “Principles of Data Integration”, published in 2012.
Dr. Halevy received his Ph.D in Computer Science from Stanford University in 1993 and his Bachelors from the Hebrew University in Jerusalem.

Resources

– Civility in the Age of Artificial Intelligence,  by STEVE LOHR, technology reporter for The New York Times, ODBMS.org

The threat from AI is real, but everyone has it wrong, by Robert Munro, CEO Idibon, ODBMS.org

Related Posts

On Artificial Intelligence and Society. Interview with Oren Etzioni, ODBMS Industry Watch.

– On Big Data and Society. Interview with Viktor Mayer-Schönberger ODBMS Industry Watch.

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/04/recruit-institute-of-technology-interview-with-alon-halevy/feed/ 0
On Big Data and Society. Interview with Viktor Mayer-Schönberger http://www.odbms.org/blog/2016/01/on-big-data-and-society-interview-with-viktor-mayer-schonberger/ http://www.odbms.org/blog/2016/01/on-big-data-and-society-interview-with-viktor-mayer-schonberger/#comments Fri, 08 Jan 2016 09:06:10 +0000 http://www.odbms.org/blog/?p=4051

“There is potentially too much at stake to delegate the issue of control to individuals who are neither aware nor knowledgable enough about how their data is being used to raise alarm bells and sue data processors.”–Viktor Mayer-Schönberger.

On Big Data and Society, I have interviewed Viktor Mayer-Schönberger, Professor of Internet Governance and Regulation at Oxford University (UK).

Happy New Year!

RVZ

Q1. Is big data changing people’s everyday world in a tangible way?

Viktor Mayer-Schönberger: Yes, of course. Most of us search online regularly. Internet search engines would not work nearly as well without Big Data (and those of us old enough to remember the Yahoo menus of the 1990s know how difficult it was then to find anything online). We would not have recommendation engines helping us find the right product (and thus reducing inefficient transaction costs), nor would flying in a commercial airplane be nearly as safe as it is today.

Q2. You mentioned in your recent book with Kenneth Cukier, Big Data: A Revolution That Will Transform How We Live Work and Think, that the fundamental shift is not in the machines that calculate data but in the data itself and how we use it. But what about people?

Viktor Mayer-Schönberger: I do not think data has agency (in contrast to Latour), so of course humans are driving the development. The point we were making is that the source of value isn’t the huge computing cluster or the smart statistical algorithm, but the data itself. So when for instance asking about the ethics of Big Data it is wrong to focus on the ethics of algorithms, and much more appropriate to focus on the ethics of data use.

Q3. What is more important people`s good intention or good data?

Viktor Mayer-Schönberger: This is a bit like asking whether one prefers apples or sunshine. Good data (being comprehensive and of high quality) reflects reality and thus can help us gain insights into how the world works. That does not make such discovery ethical, even though the discover is correct. Good intentions point towards an ethical use of data, which helps protect us again unethical data uses, but does not prevent false big data analysis. This is a long way of saying we need both, albeit for different reasons.

Q4. What are your suggestion for concrete steps that can be taken to minimize and mitigate big data’s risk?

Viktor Mayer-Schönberger: I have been advocating ex ante risk assessments of big data uses, rather than (as at best we have today) ex post court action. There is potentially too much at stake to delegate the issue of control to individuals who are neither aware nor knowledgable enough about how their data is being used to raise alarm bells and sue data processors. This is not something new. There are many areas of modern life that are so difficult and intransparent for individuals to control that we have delegated control to competent government agencies.
For instance, we don’t test the food in supermarkets ourselves for safety, nor do we crash-test cars before we buy them (or Tv sets, washing machines or microwave ovens), or run our own drug trials.
In all of these cases we put in place stringent regulation that has at its core a suitable process of risk assessment, and a competent agency to enforce it. This is what we need for Big Data as well.

Q5. Do you believe is it possible to ensure transparency, guarantee human freewill, and strike a better balance on privacy and the use of personal information?

Viktor Mayer-Schönberger: Yes, I do believe that. Clearly, today we are getting not enough transparency, and there aren’t sufficiently effective guarantees for free will and privacy in place. So we can do better. And we must.

Q6. You coined in your book the terms “propensity” and “fetishization” of data. What do you mean with these terms?

Viktor Mayer-Schönberger: I don’t think we coined the term “propensity”. It’s an old term denoting the likelihood of something happening. With the “fetishization of data” we meant the temptation (in part caused by our human bias towards causality – understanding the world around us as a sequence of causes and effects) to imbue the results of Big Data analysis with more meaning than they deserve, especially suggesting that they tell us why when they only tell us what.

Q7. Can big and open data be effectively used for the common good?

Viktor Mayer-Schönberger: Of course. Big Data is at its core about understanding the world better than we do today. I would not be in the academy if I did not believe strongly that knowledge is essential for human progress.

Q8. Assuming there is a real potential in using data–driven methods to both help charities develop better services and products, and understand civil society activity. What are the key lessons and recommendations for future work in this space?

Viktor Mayer-Schönberger: My sense is that we need to hope for two developments. First, that more researchers team up with decision makers in charities, and more broadly civil society organizations (and the government) to utilize Big Data to improve our understanding of the key challenges that our society is facing. We need to improve our understanding. Second, we also need decision makers and especially policy makers to better understand the power of Big Data – they need to realize that for their decision making data is their friend; and they need to know that especially here in Europe, the cradle of enlightenment and modern science, data-based rationality is the antidote to dangerous beliefs and ideologies.

Q9. What are your current areas of research?

Viktor Mayer-Schönberger: I have been working on how Big Data is changing learning and the educational system, as well as how Big Data changes the process of discovery, and how this has huge implications, for instance in the medical field.

——————
Viktor Mayer-Schönberger is Professor of Internet Governance and Regulation at Oxford University. In addition to the best-selling “Big Data” (with Kenneth Cukier), Mayer-Schönberger has published eight books, including the awards-winning “Delete: The Virtue of Forgetting in the Digital Age” and is the author of over a hundred articles and book chapters on the information economy. He is a frequent public speaker, and his work have been featured in (among others) New York Times, Wall Street Journal, Financial Times, The Economist, Nature and Science.

Books
Mayer-Schönberger, V. and Cukier, K. (2013) Big Data: A Revolution That Will Transform How We Live, Work and Think. John Murray.

Mayer-Schönberger, V. (2009) Delete – The Virtue of Forgetting in the Digital Age. Princeton University Press.

Related Posts

Have we closed the “digital divide”, or is it just getting wider? Andrea Powell, CIO, CABI. ODBMS.org January 1, 2016

How can Open Data help to solve long-standing problems in agriculture and nutrition? BY Andrea Powell,CIO, CABI. ODBMS.org, December 7, 2015

Big Data and Large Numbers of People: the Need for Group Privacy by Prof. Luciano Floridi, Oxford Internet Institute, University of Oxford. ODBMS.org, March 2, 2015

——————
Follow ODBMS.org on Twitter: @odbmsorg.

##

]]>
http://www.odbms.org/blog/2016/01/on-big-data-and-society-interview-with-viktor-mayer-schonberger/feed/ 0
On Data Curation. Interview with Andy Palmer http://www.odbms.org/blog/2015/01/interview-andy-palmer-tamr/ http://www.odbms.org/blog/2015/01/interview-andy-palmer-tamr/#comments Wed, 14 Jan 2015 09:07:47 +0000 http://www.odbms.org/blog/?p=3644

“We propose more data transparency not less.”Andy Palmer

I have interviewed Andy Palmer, a serial entrepreneur, who co-founded Tamr, with database scientist and MIT professor Michael Stonebraker.

Happy and Peaceful 2015!

RVZ

Q1. What is the business proposition of Tamr?

Andy Palmer: Tamr provides a data unification platform that reduces by as much as 90% the time and effort of connecting and enriching multiple data sources to achieve a unified view of silo-ed enterprise data. Using Tamr, organizations are able to complete data unification projects in days or weeks versus months or quarters, dramatically accelerating time to analytics.
This capability is particularly valuable to businesses as they can get a 360-degree view of the customer, unify their supply chain data for reducing costs or risk, e.g. parts catalogs and supplier lists, and speed up conversion of clinical trial data for submission to the FDA.

Q2. What are the main technological and business challenges in producing a single, unified view across various enterprise ERPs, Databases, Data Warehouses, back-office systems, and most recently sensor and social media data in the enterprise?

Andy Palmer: Technological challenges include:
Silo-ed data, stored in varying formats and standards
– Disparate systems, instrumented but expensive to consolidate and difficult to synchronize
– Inability to use knowledge from data owners/experts in a programmatic way
– Top-down, rules-based approaches not able to handle the extreme variety of data typically found, for example, in large PLM and ERP systems.

Business challenges include:
– Globalization, where similar or duplicate data may exist in different places in multiple divisions
M&As, which can increase the volume, variety and duplication of enterprise data sources overnight
– No complete view of enterprise data assets
– “Analysis paralysis,” the inability of business people to access the data they want/need because IT people are in the critical path of preparing it for analysis

Tamr can connect and enrich data from internal and external sources, from structured data in relational databases, data warehouses, back-office systems and ERP/PLM systems to semi- or unstructured data from sensors and social media networks.

Q3. How do you manage to integrate various part and supplier data sources to produce a unified view of vendors across the enterprise?

Andy Palmer: Patent-pending technology using machine learning algorithms performs most of the work, unifying up to 90% of supplier, part and site entities by:

– Referencing each transaction and record across many data sources

– Building correct supplier names, addresses, ID’s, etc. for a variety of analytics

– Cataloging into an organized inventory of sources, entities, and attributes

When human intervention is necessary, Tamr generates questions for data experts, aggregates responses, and feeds them back into the system. This feedback enables Tamr to continuously improve its accuracy and speed.

Q4. Who should be using Tamr?

Andy Palmer: Organizations whose business and profitability depend on being able to do analysis on a unified set of data, and ask questions of that data, should be using Tamr.

Examples include:
– a manufacturer that wants to optimize spend across supply chains, but lacks a unified view of parts and suppliers.

– a biopharmaceutical company that needs to achieve a unified view of diverse clinical trials data to convert it to mandated CDISC standards for ongoing submissions to the FDA – but lacks an automated and repeatable way to do this.

– a financial services company that wants to achieve a unified view of its customers – but lacks an efficient, repeatable way to unify customer data across multiple systems, applications, and its consumer banking, loans, wealth management and credit card businesses.

– the research arm of a pharmaceutical company that wants to unify data on bioassay experiments across 8,000 research scientists, to achieve economies, avoid duplication of effort and enable better collaboration

Q5. “Data transparency” is not always welcome in the enterprise, mainly due to non-technical reasons. What do you suggest to do in order to encourage people in the enterprise to share their data?

Andy Palmer: We propose more data transparency not less.
This is because in most companies, people don’t even know what data sources are available to them, let alone have insight into them or use of them. With Tamr, companies can create a catalog of all their enterprise data sources; they can then choose how transparent to make those individual data sources, by showing meta data about each. Then, they can control usage of the data sources using the enterprise’s access management and security policies/systems.
On the business side, we have found that people in enterprises typically want an easier way to share the data sources they have built or nurtured ─ a way that gets them out of the critical path.
Tamr makes people’s data usable by many others and for many purposes, while eliminating the busywork involved.

Q6. What is Data Curation and why is it important for Big Data?

Andy Palmer: Data Curation is the process of creating a unified view of your data with the standards of quality, completeness, and focus that you define. A typical curation process consists of:

Identifying data sets of interest (whether from inside the enterprise or outside),

Exploring the data (to form an initial understanding),

Cleaning the incoming data (for example, 99999 is not a valid ZIP code),

Transforming the data (for example, to remove phone number formatting),

Unifying it with other data of interest (into a composite whole), and

Deduplicating the resulting composite.

Data Curation is important for Big Data because people want to mix and match from all the data available to them ─ external and internal ─ for analytics and downstream applications that give them competitive advantage. Tamr is important because traditional, rule-based approaches to data curation are not sufficient to solve the problem of broad integration.

Q7. What does it mean to do “fuzzy” matches between different data sources?

Andy Palmer: Tamr can make educated guesses that two similar fields refer to the same entity even though the fields describe it differently: for example, Tamr can tell that “IBM” and “International Business Machines” refer to the same company.
In Supply Chain data unification, fuzzy matching is extremely helpful in speeding up entity and attribute resolution between parts, suppliers and customers.
Tamr’s secret sauce: Connecting hundreds or thousands of sources through a bottom-up, probabilistic solution reminiscent of Google’s approach to web search and connection.
Tamr’s upside: it becomes the Google of Enterprise Data, using probabilistic data source connection and curation to revolutionize enterprise data analysis.

Q8. What is data unification and how effective is it to use Machine Learning for this?

Andy Palmer: Data Unification is part of the curation process, during which related data sources are connected to provide a unified view of a given entity and its associated attributes. Tamr’s application of machine learning is very effective: it can get you 90% of the way to data unification in many cases, then involve human experts strategically to guide unification the rest of the way.

Q9. How do you leverage the knowledge of existing business experts for guiding/ modifying the machine learning process?

Andy Palmer: Patent-pending technology using machine learning algorithms performs most of the data integration work. When human intervention is necessary, Tamr generates questions for data experts, sends them simple yes-no questions, aggregates their responses, and feeds them back into the system. This feedback enables Tamr to continuously improve its accuracy and speed.

Q10. With Tamr you claim that less human involvement is required as the systems “learns.” What are in your opinion the challenges and possible dangers of such an “automated” decision making process if not properly used or understood? Isn’t there a danger of replacing the experts with intelligent machines?

Andy Palmer: We aren’t replacing human experts at all: we are bringing them into the decision-making process in a high-value, programmatic way. And there are data stewards and provenance and governance procedures in place that control how this done. For example: in one of our pharma customers, we’re actually bringing the research scientists who created the data into the decision-making process, capturing their wisdom in Tamr. Before, they were never asked: some guy in IT was trying to guess what each scientist meant when he created his data. Or the scientists were asked via email, which, due to the nature of the biopharmaceutical industry, required printing out the emails for audit purposes.

Q11. How do you quantify the cost savings using Tamr?

Andy Palmer: The biggest savings aren’t from the savings in data curation (although these are significant), but the opportunities for savings uncovered through analysis of unified data ─ opportunities that wouldn’t otherwise have been discovered. For example, by being able to create and update a ‘golden record’ of suppliers across different countries and business groups, Tamr can provide a more comprehensive view of supplier spend.
You can use this view to identify long-tail opportunities for savings across many smaller suppliers, instead of the few large vendors visible to you without Tamr.
In the aggregate, these long-tail opportunities can easily account for 85% of total spend savings.

Q12. Could you give us some examples of use cases where Tamr is making a significant difference?

Andy Palmer: Supply Chain Management, for streamlining spend analytics and spend management. Unified views of supplier and parts data enable optimization of supplier payment terms, identification of “long-tail” savings opportunities in small or outlier suppliers that were not easily identifiable before.

Clinical Trials Management, for automated conversion of multi-source /multi-standard CDISC data (typically stored in SaS databases) to meet submission standards mandated by regulators.
Tamr eliminates manual methods, which are usually conducted by expensive outside consultants and can result in additional, inflexible data stored in proprietary formats; and provides a scalable, repeatable process for data conversion (IND/NDA programs necessitate frequent resubmission of data).

Sales and Marketing, for achieving a unified view of the customer.
Tamr enables the business to connect and unify customer data across multiple applications, systems and business units, to improve segmentation/targeting and ultimately sell more products and services.

——————–

Andy Palmer, Co-Founder and CEO, Tamr Inc.

Andy Palmer is co-founder and CEO of Tamr, Inc. Palmer co-founded Tamr with fellow entrepreneur Michael Stonebraker, PhD. Previously, Palmer was co-founder and founding CEO of Vertica Systems, a pioneering big data analytics company (acquired by HP). During his career as an entrepreneur, Palmer has served as founder, founding investor, BOD member or advisor to more than 50 start-up companies. He also served as Global Head of Software Engineering and Architecture at Novartis Institutes for BioMedical Research (NIBR) and as a member of the start-up team and Senior Vice President of Operations and CIO at Infinity Pharmaceuticals (NASDAQ: INFI). He earned undergraduate degrees in English, history and computer science from Bowdoin College, and an MBA from the Tuck School of Business at Dartmouth.
————————–
-Resources

Data Science is mainly a Human Science. ODBMS.org, October 7, 2014

Big Data Can Drive Big Opportunities, by Mike Cavaretta, Data Scientist and Manager at Ford Motor Company. ODBMS.org, October 2014.

Big Data: A Data-Driven Society? by Roberto V. Zicari, Goethe University, Stanford EE Computer Systems Colloquium, October 29, 2014

-Related Posts

On Big Data Analytics. Interview with Anthony Bak. ODBMS Industry Watch, December 7, 2014

Predictive Analytics in Healthcare. Interview with Steve Nathan. ODBMS Industry Watch, August 26, 2014

-Webinar
January 27th at 1PM
Webinar: Toward Automated, Scalable CDISC Conversion
John Keilty, Third Rock Ventures | Timothy Danford, Tamr, Inc.

During a one-hour webinar, join John Keilty, former VP of Informatics at Infinity Pharmaceuticals, and Timothy Danford, CDISC Solution Lead for Tamr, as they discuss some of the key challenges in preparing clinical trial data for submission to the FDA, and the problems associated with current preparation processes.

Follow ODBMS.org on twitter: @odbsmorg

]]>
http://www.odbms.org/blog/2015/01/interview-andy-palmer-tamr/feed/ 0