ODBMS Industry Watch » Google http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Fri, 09 Feb 2018 21:04:31 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.19 On Open Source Databases. Interview with Peter Zaitsev http://www.odbms.org/blog/2017/09/on-open-source-databases-interview-with-peter-zaitsev/ http://www.odbms.org/blog/2017/09/on-open-source-databases-interview-with-peter-zaitsev/#comments Wed, 06 Sep 2017 00:49:18 +0000 http://www.odbms.org/blog/?p=4448

“To be competitive with non-open-source cloud deployment options, open source databases need to invest in “ease-of-use.” There is no tolerance for complexity in many development teams as we move to “ops-less” deployment models.” –Peter Zaitsev

I have interviewed Peter Zaitsev, Co-Founder and CEO of Percona.
In this interview, Peter talks about the Open Source Databases market; the Cloud; the scalability challenges at Facebook; compares MySQL, MariaDB, and MongoDB; and presents Percona’s contribution to the MySQL and MongoDB ecosystems.

RVZ

Q1. What are the main technical challenges in obtaining application scaling?

Peter Zaitsev: When it comes to scaling, there are different types. There is a Facebook/Google/Alibaba/Amazon scale: these giants are pushing boundaries, and usually are solving very complicated engineering problems at a scale where solutions aren’t easy or known. This often means finding edge cases that break things like hardware, operating system kernels and the database. As such, these companies not only need to build a very large-scale infrastructures, with a high level of automation, but also ensure it is robust enough to handle these kinds of issues with limited user impact. A great deal of hardware and software deployment practices must to be in place for such installations.

While these “extreme-scale” applications are very interesting and get a lot of publicity at tech events and in tech publications, this is a very small portion of all the scenarios out there. The vast majority of applications are running at the medium to high scale, where implementing best practices gets you the scalability you need.

When it comes to MySQL, perhaps the most important question is when you need to “shard.” Sharding — while used by every application at extreme scale — isn’t a simple “out-of-the-box” feature in MySQL. It often requires a lot of engineering effort to correctly implement it.

While sharding is sometimes required, you should really examine whether it is necessary for your application. A single MySQL instance can easily handle hundreds of thousands per second (or more) of moderately complicated queries, and Terabytes of data. Pair that with MemcacheD or Redis caching, MySQL Replication or more advanced solutions such as Percona XtraDB Cluster or Amazon Aurora, and you can cover the transactional (operational) database needs for applications of a very significant scale.

Besides making such high-level architecture choices, you of course need to also ensure that you exercise basic database hygiene. Ensure that you’re using the correct hardware (or cloud instance type), the right MySQL and operating system version and configuration, have a well-designed schema and good indexes. You also want to ensure good capacity planning, so that when you want to take your system to the next scale and begin to thoroughly look at it you’re not caught by surprise.

Q2. Why did Facebook create MyRocks, a new flash-optimized transactional storage engine on top of RocksDB storage engine for MySQL?

Peter Zaitsev: The Facebook Team is the most qualified to answer this question. However, I imagine that at Facebook scale being efficient is very important because it helps to drive the costs down. If your hot data is in the cache when it is important, your database is efficient at handling writes — thus you want a “write-optimized engine.”
If you use Flash storage, you also care about two things:

      – A high level of compression since Flash storage is much more expensive than spinning disk.

– You are also interested in writing as little to the storage as possible, as the more you write the faster it wears out (and needs to be replaced).

RocksDB and MyRocks are able to achieve all of these goals. As an LSM-based storage engine, writes (especially Inserts) are very fast — even for giant data sizes. They’re also much better suited for achieving high levels of compression than InnoDB.

This Blog Post by Mark Callaghan has many interesting details, including this table which shows MyRocks having better performance, write amplification and compression for Facebook’s workload than InnoDB.
Percona

Q3. Beringei is Facebook’s open source, in-memory time series database. According to Facebook, large-scale monitoring systems cannot handle large-scale analysis in real time because the query performance is too slow. What is your take on this?

Peter Zaitsev: Facebook operates at extreme scale, so it is no surprise the conventional systems don’t scale well enough or aren’t efficient enough for Facebook’s needs.

I’m very excited Facebook has released Beringei as open source. Beringei itself is a relatively low-end storage engine that is hard to use for a majority of users, but I hope it gets integrated with other open source projects and provides a full-blown high-performance monitoring solution. Integrating it with Prometheus would be a great fit for solutions with extreme data ingestion rates and very high metric cardinality.

Q4. How do you see the market for open source databases evolving?

Peter Zaitsev: The last decade has seen a lot of open source database engines built, offering a lot of different data models, persistence options, high availability options, etc. Some of them were build as open source from scratch, while others were released as open source after years of being proprietary engines — with the most recent example being CMDB2 by Bloomberg. I think this heavy competition is great for pushing innovation forward, and is very exciting! For example, I think if that if MongoDB hadn’t shown how many developers love a document-oriented data model, we might never of seen MySQL Document Store in the MySQL ecosystem.

With all this variety, I think there will be a lot of consolidation and only a small fraction of these new technologies really getting wide adoption. Many will either have niche deployments, or will be an idea breeding ground that gets incorporated into more popular database technologies.

I do not think SQL will “die” anytime soon, even though it is many decades old. But I also don’t think we will see it being the dominant “database” language, as it has been since the turn of millennia.

The interesting disruptive force for open source technologies is the cloud. It will be very interesting for me to see how things evolve. With pay-for-use models of the cloud, the “free” (as in beer) part of open source does not apply in the same way. This reduces incentives to move to open source databases.

To be competitive with non-open-source cloud deployment options, open source databases need to invest in “ease-of-use.” There is no tolerance for complexity in many development teams as we move to “ops-less” deployment models.

Q5. In your opinion what are the pros and cons of MySQL vs. MariaDB?

Peter Zaitsev: While tracing it roots to MySQL, MariaDB is quickly becoming a very different database.
It implements some features MySQL doesn’t, but also leaves out others (MySQL Document Store and Group Replication) or implements them in a different way (JSON support and Replication GTIDs).

From the MySQL side, we have Oracle’s financial backing and engineering. You might dislike Oracle, but I think you agree they know a thing or two about database engineering. MySQL is also far more popular, and as such more battle-tested than MariaDB.

MySQL is developed by a single company (Oracle) and does not have as many external contributors compared to MariaDB — which has its own pluses and minuses.

MySQL is “open core,” meaning some components are available only in the proprietary version, such as Enterprise Authentication, Enterprise Scalability, and others. Alternatives for a number of these features are available in Percona Server for MySQL though (which is completely open source). MariaDB Server itself is completely open source, through there are other components that aren’t that you might need to build a full solution — namely MaxScale.

Another thing MariaDB has going for it is that it is included in a number of Linux distributions. Many new users will be getting their first “MySQL” experience with MariaDB.

For additional insight into MariaDB, MySQL and Percona Server for MySQL, you can check out this recent article

Q6. What’s new in the MySQL and MongoDB ecosystem?

Peter Zaitsev: This could be its own and rather large article! With MySQL, we’re very excited to see what is coming in MySQL 8. There should be a lot of great changes in pretty much every area, ranging from the optimizer to retiring a lot of architectural debt (some of it 20 years old). MySQL Group Replication and MySQL InnoDB Cluster, while still early in their maturity, are very interesting products.

For MongoDB we’re very excited about MongoDB 3.4, which has been taking steps to be a more enterprise ready database with features like collation support and high-performance sharding. A number of these features are only available in the Enterprise version of MongoDB, such as external authentication, auditing and log redaction. This is where Percona Server for MongoDB 3.4 comes in handy, by providing open source alternatives for the most valuable Enterprise-only features.

For both MySQL and MongoDB, we’re very excited about RocksDB-based storage engines. MyRocks and MongoRocks both offer outstanding performance and efficiency for certain workloads.

Q7. Anything else you wish to add?

Peter Zaitsev: I would like to use this opportunity to highlight Percona’s contribution to the MySQL and MongoDB ecosystems by mentioning two of our open source products that I’m very excited about.

First, Percona XtraDB Cluster 5.7.
While this has been around for about a year, we just completed a major performance improvement effort that allowed us to increase performance up to 10x. I’m not talking about improving some very exotic workloads: these performance improvements are achieved in very typical high-concurrency environments!

I’m also very excited about our Percona Monitoring and Management product, which is unique in being the only fully packaged open source monitoring solution specifically built for MySQL and MongoDB. It is a newer product that has been available for less than a year, but we’re seeing great momentum in adoption in the community. We are focusing many of our resources to improving it and making it more effective.

———————

Peter Zaitsev_Percona

Peter Zaitsev co-founded Percona and assumed the role of CEO in 2006. As one of the foremost experts on MySQL strategy and optimization, Peter leveraged both his technical vision and entrepreneurial skills to grow Percona from a two-person shop to one of the most respected open source companies in the business. With more than 150 professionals in 29 countries, Peter’s venture now serves over 3000 customers – including the “who’s who” of Internet giants, large enterprises and many exciting startups. Percona was named to the Inc. 5000 in 2013, 2014, 2015 and 2016.

Peter was an early employee at MySQL AB, eventually leading the company’s High Performance Group. A serial entrepreneur, Peter co-founded his first startup while attending Moscow State University where he majored in Computer Science. Peter is a co-author of High Performance MySQL: Optimization, Backups, and Replication, one of the most popular books on MySQL performance. Peter frequently speaks as an expert lecturer at MySQL and related conferences, and regularly posts on the Percona Data Performance Blog. He has also been tapped as a contributor to Fortune and DZone, and his recent ebook Practical MySQL Performance Optimization Volume 1 is one of percona.com’s most popular downloads.
————————-

Resources

Percona, in collaboration with Facebook, announced the first experimental release of MyRocks in Percona Server for MySQL 5.7, with packages. September 6, 2017

eBook, “Practical MySQL Performance Optimization,” by Percona CEO Peter Zaitsev and Principal Consultant Alexander Rubin. (LINK to DOWNLOAD, registration required)

MySQL vs MongoDB – When to Use Which Technology. Peter Zaitsev, June 22, 2017

Percona Live Open Source Database Conference Europe, Dublin, Ireland. September 25 – 27, 2017

Percona Monitoring and Management (PMM) Graphs Explained: MongoDB with RocksDB, By Tim Vaillancourt,JUNE 18, 2017

Related Posts

On Apache Ignite, Apache Spark and MySQL. Interview with Nikita Ivanov. ODBMS Industry Watch, 2017-06-30

On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah. ODBMS Industry Watch,2017-03-13

On in-memory, key-value data stores. Ofer Bengal and Yiftach Shoolman. ODBMS Industry Watch, 2017-02-13

follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/09/on-open-source-databases-interview-with-peter-zaitsev/feed/ 0
Internet of Things: Safety, Security and Privacy. Interview with Vint G. Cerf http://www.odbms.org/blog/2017/06/internet-of-things-safety-security-and-privacy-interview-with-vint-g-cerf/ http://www.odbms.org/blog/2017/06/internet-of-things-safety-security-and-privacy-interview-with-vint-g-cerf/#comments Sun, 11 Jun 2017 17:06:03 +0000 http://www.odbms.org/blog/?p=4373

” I like the idea behind programmable, communicating devices and I believe there is great potential for useful applications. At the same time, I am extremely concerned about the safety, security and privacy of such devices.” –Vint G. Cerf

I had the pleasure to interview Vinton G. Cerf. Widely known as one of the “Fathers of the Internet,” Cerf is the co-designer of the TCP/IP protocols and the architecture of the Internet. Main topic of the interview is the Internet of Things (IoT) and its challenges, especially the safety, security and privacy of IoT devices.
Vint is currently Chief Internet Evangelist for Google.
RVZ

Q1. Do you like the Internet of Things (IoT)?

Vint Cerf: This question is far too general to answer. I like the idea behind programmable, communicating devices and I believe there is great potential for useful applications. At the same time, I am extremely concerned about the safety, security and privacy of such devices. Penetration and re-purposing of these devices can lead to denial of service attacks (botnets), invasion of privacy, harmful dysfunction, serious security breaches and many other hazards. Consequently the makers and users of such devices have a great deal to be concerned about.

Q2. Who is going to benefit most from the IoT?

Vint Cerf: The makers of the devices will benefit if they become broadly popular and perhaps even mandated to become part of local ecosystem. Think “smart cities” for example. The users of the devices may benefit from their functionality, from the information they provide that can be analyzed and used for decision-making purposes, for example. But see Q1 for concerns.

Q3. One of the most important requirement for collections of IoT devices is that they guarantee physical safety and personal security. What are the challenges from a safety and privacy perspective that the pervasive introduction of sensors and devices pose? (e.g. at home, in cars, hospitals, wearables and ingestible, etc.)

Vint Cerf: Access control and strong authentication of parties authorized to access device information or control planes will be a primary requirement. The devices must be configurable to resist unauthorized access and use. Putting physical limits on the behavior of programmable devices may be needed or at least advisable (e.g., cannot force the device to operate outside of physically limited parameters).

Q5. Consumers want privacy. With IoT physical objects in our everyday lives will increasingly detect and share observations about us. How is it possible to reconcile these two aspects?

Vint Cerf: This is going to be a tough challenge. Videocams that help manage traffic flow may also be used to monitor individuals or vehicles without their permission or knowledge, for example (cf: UK these days). In residential applications, one might want (insist on) the ability to disable the devices manually, for example. One would also want assurances that such disabling cannot be defeated remotely through the software.

Q6. Let`s talk about more about security. It is reported that badly configured “smart devices” might provide a backdoor for hackers. What is your take on this?

Vint Cerf: It depends on how the devices are connected to the rest of the world. A particularly bad scenario would have a hacker taking over the operating system of 100,000 refrigerators. The refrigerator programming could be preserved but the hacker could add any of a variety of other functionality including DDOS capacity, virus/worm/Trojan horse propagation and so on.
One might want the ability to monitor and log the sources and sinks of traffic to/from such devices to expose hacked devices under remote control, for example. This is all a very real concern.

Q7. What measures can be taken to ensure a more “secure” IoT?

Vint Cerf: Hardware to inhibit some kinds of hacking (e.g. through buffer overflows) can help. Digital signatures on bootstrap programs checked by hardware to inhibit boot-time attacks. Validation of software updates as to integrity and origin. Whitelisting of IP addresses and identifiers of end points that are allowed direct interaction with the device.

Q8. Is there a danger that IoT evolves into a possible enabling platform for cyber-criminals and/or for cyber war offenders?

Vint Cerf: There is no question this is already a problem. The DYN Corporation DDOS attack was launched by a botnet of webcams that were readily compromised because they had no access controls or well-known usernames and passwords. This is the reason that companies must feel great responsibility and be provided with strong incentives to limit the potential for abuse of their products.

Q9. What are your personal recommendations for a research agenda and policy agenda based on advances in the Internet of Things?

Vint Cerf: Better hardware reinforcement of access control and use of the IOT computational assets. Better quality software development environments to expose vulnerabilities before they are released into the wild. Better software update regimes that reduce barriers to and facilitate regular bug fixing.

Q10. The IoT is still very much a work in progress. How do you see the IoT evolving in the near future?

Vint Cerf: Chaotic “standardization” with many incompatible products on the market. Many abuses by hackers. Many stories of bugs being exploited or serious damaging consequences of malfunctions. Many cases of “one device, one app” that will become unwieldy over time. Dramatic and positive cases of medical monitoring that prevents serious medical harms or signals imminent dangers. Many experiments with smart cities and widespread sensor systems.
Many applications of machine learning and artificial intelligence associated with IOT devices and the data they generate. Slow progress on common standards.

—————
Google-HS-9-2008
Vinton G. Cerf co-designed the TCP/IP protocols and the architecture of the Internet and is Chief Internet Evangelist for Google. He is a member of the National Science Board and National Academy of Engineering and Foreign Member of the British Royal Society and Swedish Royal Academy of Engineering, and Fellow of ACM, IEEE, AAAS, and BCS.
Cerf received the US Presidential Medal of Freedom, US National Medal of Technology, Queen Elizabeth Prize for Engineering, Prince of Asturias Award, Japan Prize, ACM Turing Award, Legion d’Honneur and 29 honorary degrees.

Resources

European Commission, Internet of Things Privacy & Security Workshop’s Report,10/04/2017

Securing the Internet of Things. US Homeland Security, November 16, 2016

Related Posts

Social and Ethical Behavior in the Internet of Things By Francine Berman, Vinton G. Cerf. Communications of the ACM, Vol. 60 No. 2, Pages 6-7, February 2017

Security in the Internet of Things, McKinsey & Company,May 2017

Interview to Vinton G. Cerf. ODBMS Industry Watch, July 27, 2009

Five Challenges to IoT Analytics Success. By Dr. Srinath Perera. ODBMS.org, September 23, 2016

Follow us on Twitter: @odbsmorg

##

]]>
http://www.odbms.org/blog/2017/06/internet-of-things-safety-security-and-privacy-interview-with-vint-g-cerf/feed/ 0
Identity Graph Analysis at Scale. Interview with Niels Meersschaert http://www.odbms.org/blog/2017/05/interview-with-niels-meersschaert/ http://www.odbms.org/blog/2017/05/interview-with-niels-meersschaert/#comments Tue, 09 May 2017 07:10:19 +0000 http://www.odbms.org/blog/?p=4359

“I’ve found the best engineers actually have art backgrounds or interests. The key capability is being able to see problems from multiple perspectives, and realizing there are multiple solutions to a problem. Music, photography and other arts encourage that.”–Niels Meersschaert.

I have interviewed Niels Meersschaert, Chief Technology Officer at Qualia. The Qualia team relies on over one terabyte of graph data in Neo4j, combined with larger amounts of non-graph data to provide major companies with consumer insights for targeted marketing and advertising opportunities.

RVZ

Q1. Your background is in Television & Film Production. How does it relate to your current job?

Niels Meersschaert: Engineering is a lot like producing. You have to understand what you are trying to achieve, understand what parts and roles you’ll need to accomplish it, all while doing it within a budget. I’ve found the best engineers actually have art backgrounds or interests. The key capability is being able to see problems from multiple perspectives, and realizing there are multiple solutions to a problem. Music, photography and other arts encourage that. Engineering is both art and science and creativity is a critical skill for the best engineers. I also believe that a breath of languages is critical for engineers.

Q2. Your company collects data on more than 90% of American households. What kind of data do you collect and how do you use such data?

Niels Meersschaert: We focus on high quality data that is indicative of commercial intent. Some examples include wishlist interaction, content consumption, and location data. While we have the breath of a huge swath of the American population, a key feature is that we have no personally identifiable information. We use anonymous unique identifiers.
So, we know this ID did actions indicative of interest in a new SUV, but we don’t know their name, email address, phone number or any other personally identifiable information about a consumer. We feel this is a good balance of commercial need and individual privacy.

Q3. If you had to operate with data from Europe, what would be the impact of the new EU General Data Protection Regulation (GDPR) on your work?

Niels Meersschaert: Europe is a very different market than the U.S. and many of the regulations you mentioned do require a different approach to understanding consumer behaviors. Given that we avoid personal IDs, our approach is already better situated than many peers, that rely on PII.

Q4. Why did you choose a graph database to implement your system consumer behavior tracking system?

Niels Meersschaert: Our graph database is used for ID management. We don’t use it for understanding the intent data, but rather recognizing IDs. Conceptually, describing the various IDs involved is a natural fit for a graph.
As an example, a conceptual consumer could be thought of as the top of the graph. That consumer uses many devices and each device could have 1 or more anonymous IDs associated with it, such as cookie IDs. Each node can represent an associated device or ID and the relationships between each node allow us to see the path. A key element we have in our system is something we call the Borg filter. It’s a bit of a reference to Star Trek, but essentially when we find a consumer is too connected, i.e. has dozens or hundreds of devices, we remove all those IDs from the graph as clearly something has gone wrong. A graph database makes it much easier to determine how many connected nodes are at each level.

Q5. Why did you choose Neo4j?

Niels Meersschaert: Neo4J had a rich query language and very fast performance, especially if your hot set was in RAM.

Q6. You manage one terabyte of graph data in Neo4j. How do you combine them with larger amounts of non-graph data?

Niels Meersschaert: You can think of the graph as a compression system for us. While consumer actions occur on multiple devices and anonymous IDs, they represent the actions of a single consumer. This actually simplifies things for us, since the unique grouping IDs is much smaller than the unique source IDs. It also allows us to eliminate non-human IDs from the graph. This does mean we see the world in different ways they many peers. As an example, if you focus only on cookie IDs, you tend to have a much larger number of unique IDs than actual consumers those represent. Sadly, the same thing happens with website monthly uniques, many are highly inflated both on the number of unique people they represent, but also since many of the IDs are non-human. Ultimately, the entire goal of advertising is to influence consumers, so we feel that having the better representation of actual consumers allows us to be more effective.

Q7. What are the technical challenges you face when blending data with different structure?

Niels Meersschaert: A key challenge is some unifying element between different systems or structures that link data. What we did with Neo4J is create a unique property on the nodes that we use for interchange. The internal node IDs that are part of Neo4J aren’t something we use except internally within the graph DB.

Q8. If your data is sharded manually, how do you handle scalability?

Niels Meersschaert: We don’t shard the data manually, but scalability is one of the biggest challenges. We’ve spent a lot of time tuning queries and grouping operations to take advantage of some of the capabilities of Neo4J and to work around some limitations it has. The vast majority of graph customers wouldn’t have the volume nor the volatility of data that we do, so our challenges are unique.

Q9. What other technologies do you use and how they interact with Neo4j?

Niels Meersschaert: We use the classic big data tools like Hadoop and Spark. We also use MongoDB and Google’s Big Query. If you look at the graph as the truth set of device IDs, we interact with it on ingestion and export only. Everything in the middle can operate on the consumer ID, which is far more efficient.

Q10. How do you measure the ROI of your solution?

Niels Meersschaert: There are a few factors we consider. First is how much does the infrastructure cost us to process the data and output? How fast is it in terms of execution time? How much development effort does it take relative to other solutions? How flexible is it for us to extend it? This is an ever evolving situation and one we always look at how to improve, especially as a smaller business.

———————————-

Niels Meersschaert
I’ve been coding since I was 7 years old on an Apple II. I’d built radio control model cars and aircraft as a child and built several custom chassis using controlled flex as suspension to keep weight & parts count down. So, I’d had an early interest in both software and physical engineering.

My father was from the Netherlands and my maternal grandfather was a linguist fluent in 43 languages. As a kid, my father worked for the airlines, so we traveled often to Europe to see family, so I grew up multilingual. Computer languages are just different ways to describe something, the basic concepts are similar, just as they are in spoken languages albeit with different grammatical and syntax structure. Whether you’re speaking French, or writing a program in Python or C, the key is you are trying to get your communication across to the target of your message, whether it is another person or a computer.

I originally started university in aeronautical engineering, but in my sophomore year, Grumman let go about 3000 engineers, so I didn’t think the career opportunities would be as great. I’d always viewed problem solutions as a combination of art & science, so I switched majors to one in which I could combine the two.

After school I worked producing and editing commercials and industrials, often with special effects. I got into web video early on & spent a lot of time on compression and distribution systems. That led to working on search, and bringing the linguistics back front and center again. I then combined the two and came full circle back to advertising, but from the technical angle at Magnetic, where we built search retargeting. At Qualia, we kicked this into high gear, where we understand consumer intent by analyzing sentiment, content and actions across multiple devices and environments and the interaction and timing between them to understand the point in the intent path of a consumer.

Resources

EU General Data Protection Regulation (GDPR):

Reform of EU data protection rules

European Commission – Fact Sheet Questions and Answers – Data protection reform

General Data Protection Regulation (Wikipedia)

Neo4j Sandbox: The Neo4j Sandbox enables you to get started with Neo4j, with built-in guides and sample datasets for popular use cases.

Related Posts

LDBC Developer Community: Benchmarking Graph Data Management Systems. ODBMS.org, 6 APR, 2017

Graphalytics benchmark.ODBMS.org 6 APR, 2017
The Graphalytics benchmark is an industrial-grade benchmark for graph analysis platforms such as Giraph. It consists of six core algorithms, standard datasets, synthetic dataset generators, and reference outputs, enabling the objective comparison of graph analysis platforms.

Collaborative Filtering: Creating the Best Teams Ever. By Maurits van der Goes, Graduate Intern | February 16, 2017

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/05/interview-with-niels-meersschaert/feed/ 0
On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/ http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/#comments Mon, 13 Mar 2017 10:54:21 +0000 http://www.odbms.org/blog/?p=4326

“What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).”–Amr Awadallah

I have interviewed Amr Awadallah, Chief Technology Officer at Cloudera.  
Main topics of the interview are: the new developments in Apache Spark 2.0 Beta, and Hadoop  3.0.0-alpha1 release ; the lessons learned from Amr´s experience of using Hadoop at Yahoo!; and the business problems that world’s leading organisations do have.

RVZ

Q1. Before Cloudera, you served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organisations to use Hadoop for data analysis and business intelligence. What are the main lessons you learned in that period?

Amr Awadallah: Couple of things. First, I learned that Hadoop is capable of solving all the business intelligence problems that I had at Yahoo.
Namely:
(1) our systems weren’t scaling fast enough (we needed to cut down transformation times from hours to minutes),
(2) our systems weren’t economical on a $/TB basis thus making it hard to retain valuable data for longer time periods, and (3) we needed new methods to be able to store and analyze semi-structured (e.g. logs) and unstructured data (e.g. social media).
By implementing Hadoop in our team we saw first hand how it can address all these problems. The second lesson that I learned was that Hadoop, back then, was very rough to deploy and program against (it took us many months to deploy it and reprogram our transformations to run on it). It was these lessons that made it clear that there is room for a startup to focus on Hadoop since (1) it was solving a very real data problems that many organizations will face, and (2) it needed a lot of polish to make it work smoothly, securely, and reliably within the enterprise.

Q2. In 2008 you founded Cloudera together with Mike Olson (Oracle), Jeff Hammerbacher (Facebook) and Christophe Bisciglia (Google). What was your main motivation at that time?

Amr Awadallah: Pretty much to do what I describe above, we wanted to make the Hadoop technology easy to use for organizations. That included: (1) creating a distribution for Hadoop that bundles all the necessary open-source projects that make it work (we call that CDH, short for Cloudera Distribution for Apache Hadoop). (2) We also created a number of proprietary system management, security, and meta-data management tools around CDH to make it easier for organizations to deploy and operate Hadoop in production.

Q3. What are the typical challenging business problems that world’s leading organisations have?

Amr Awadallah: The technology we provide is very powerful and can be used to solve many problems across many industries, but we see four common themes: The first is simply using Hadoop as a faster, bigger, cheaper system for business intelligence and data analytics. i.e. a lot of organizations just use us to do things they have been doing already, just doing these things in a more economically scalable way.
The second use case is around deeper understanding of customers, i.e. moving away from segmenting all customers into a number of predefined buckets, but rather creating a dynamic micro-segment addressing each customer in a more precise way (thus reducing false positives).
The third use case is about using data to build better products and services, and this use-case is catalyzed by of the internet-of-things. Due to smart-sensors we are able to measure the real-world better than ever before; so this use-case is about taking all that data and leveraging it to either enhance our current product/service offerings, or build entirely new ones.
The fourth use case is about reducing business risk, and it manifests itself in a number of different sub-cases depending on the industry. For example, cyber-security is one of the key ways to reduce risk, and we have an open source project co-developed with Intel, called Apache Spot, which organizations can use to collect all their network flow data then use Spark machine learning algorithms to detect the anomalies in that data. Anti-money laundering and fraud detection is another way that our banking customers employ our platform to reduce risk within their businesses. Similarly, our insurance industry customers use our system to detect fraudulent claims, etc.

Q4. Can they be solved by analysing data? Can you give us some examples of how the use of advanced analytics drive business decisions?

Amr Awadallah: Yes, all the problems mentioned above can be solved with data. I want to highlight though that this isn’t necessarily about business decisions, which is what the Business Intelligence movement was about (we just help make that cheaper and faster). What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).
One of my favorite examples is a solution that one of our customers built to give voice to premature babies in neonatal intensive care units. They analyze the signals coming from the baby (sounds, blood pressure, heart rate, temperature, few brain signals), and based on that a message appears on the monitor above the infant showing the nurse if they are hungry, distressed from too much noise or light, etc.
That is really what we mean by using data to create new products and services that weren’t possible before (and not just reports/dashboard).

Q4. Graphs are important. Is it possible to do scalable graph analytics? If yes, how?

Amr Awadallah: Graphs are indeed important, a lot of our customer use-cases trace back to that (not just for social media analytics, but for example anti-money laundering requires analyzing relationships between many financial accounts for detecting bad behaviors, similarly for cyber security applications). I think scalability depends a fair bit on what’s being analyzed and how scalable we mean by scalable. But for most practical purposes I would say Spark’s GraphX is good enough. For example, you can compute PageRank fairly efficiently and scalably on a cluster using GraphX.

Q5. Data security is increasing important. The risk is due to the growing number of device endpoints. What solutions do exist to minimise such risk?

Amr Awadallah: A comprehensive enterprise data security strategy seeks to mitigate the risks presented by a growing number of potentially compromised endpoints connecting to corporate networks. Endpoint security will enable one or all of the following preventative controls:
The first is policy based enforcement of endpoint security configuration prior to granting and endpoint access to network based corporate assets. This ensures that any endpoint connected to corporate networks meets minimum requirements for endpoint security configuration.
The second measure is endpoint based anti-malware software (the existence of which may be a policy requirement to connect to the network per the first measure). Anti-malware prevents malicious code from infecting endpoints by monitoring for changes to system configuration and unusual activity or processes.
The third measure is endpoint encryption of corporate data on hard drives, folders and even removable media.
As mentioned above we also collaborate with Intel on Apache Spot, which tracks network flow patterns to detect anomalous communication behavior between different devices (including end point devices). Apache Spot just recently won InfoWorld 2017 Tech of the Year Award. Other advanced analytics security partners we closely work with are: CounterTack, Securonix, Niara, and Jask.

Q6. You recently announced the availability of an Apache Spark 2.0 Beta release for users of the Cloudera platform. How does it work? And how does it differ from the Hadoop-based data platform?

Amr Awadallah: First, at a meta-level, Hadoop (MapReduce specifically) was very good at achieving scalable computation by spreading jobs across many CPU cores and hard disk spindles. That said, MapReduce wasn’t very efficient in how it leveraged memory to optimize the performance of data processing pipelines that have many stages or iterations.
The main power of Spark, that made it take over from MapReduce, was how it truly leveraged memory to achieve better performance in deep or iterative data pipelines. That coupled with a simpler developer API made Spark take over very quickly from MapReduce.
Most of our new customer implementations for data processing or data science tend to be in Spark these days, versus MapReduce.
I should clarify however that this doesn’t mean that Hadoop is dead as some say. Apache Hadoop is comprised of three key subsystems: (1) MapReduce for computation, (2) YARN for resource scheduling, and (3) HDFS for storage. Spark only replaces MapReduce, we still rely heavily on both YARN and HDFS.

That said, the most notable features in Apache Spark 2.0 are:

1) Dataset API: It is a new API that represents the distributed collections of objects processed by Spark’s execution engine. It is an extension of Spark’s Dataframe API. It improves upon the Dataframe API by providing type-safe, object oriented programming interfaces. Users can now write User-Defined Functions and Lambda functions that provide compile time type safety. With the Dataset API, users benefit from optimized operations (like sort, join, hash, etc) in the SparkSQL engine, while also getting compile time type safety for user defined functions.

2) Model & Pipeline Persistence in Spark’s ML library: Machine learning Pipelines built with Spark’s ML library can now be serialized to a file and read back in.
The ability to save and reload these pipelines makes it easy for users to perform version control on the pipelines and safely distribute the pipelines. This helps in operationalizing them in production systems.

3) Structured Streaming: New stream processing API and engine that provides SQL like abstractions for authoring operations on data streams, and also improves performance by using the SparkSQL engine for processing the data streams. However, this is still an experimental API and not ready for production usage yet.

Besides the above 3 notable enhancements, there are a bunch of performance and scalability improvements across the board.

Q7. Apache Impala vs. Amazon Redshift: How Does Redshift Compare to Impala?

Amr Awadallah: Apache Impala is an analytic database engine architecturally designed to perform high-performance highly-concurrent SQL analytics on scalable, open data platforms like Hadoop’s HDFS and Amazon S3.
Impala decouples data storage from compute and lets users query data without having to move/load data specifically into an Impala storage-engine (it doesn’t have one). This architectural difference uniquely enables Impala to deliver a more flexible Business Intelligence experience than traditional database architectures like Redshift (which requires pre-loading the data).

Some of the key benefits of the Impala approach include:

* On-demand resources that are immediately ready to query existing S3 data without loading to a different data silo
* Ability to elastically grow/shrink clusters as needed due to decoupled storage and compute
* More predictable, multi-tenant isolation due to the ability to have multiple Impala clusters sharing a common S3 data repository
* Ability to share common data not only amongst Impala clusters, but also any application that runs on cloud-native S3 storage (for example, you can have both Apache Impala and Apache Spark run against the same data asset in S3, while it isn’t possible to have Apache Spark easily access the data stored in Redshift, it has to go through SQL first).
* Greater flexibility to explore new use cases, analytics, and data by directly querying S3 without rigid traditional data models and ETL

Not only does Impala deliver this additional flexibility, it does so at greater cost-performance and scalability compared to Redshift. See the following benchmark for data on that.

That said, Redshift’s sweet spot is in a different target as a smaller datamart as most Redshift installations are in the dozen of nodes range where Redshift’s limitations in scalability, elasticity, flexibility, and requirement to maintain separate copies of data are less critical.

Q8. What is Apache Kudu, and why is it relevant for Impala Users?

Amr Awadallah: Historically we had two storage engines in our distribution: (1) HDFS which is optimized for high-throughput analytics, but doesn’t support updates/inserts and (2) HBase which is optimized for low-latency updates/inserts but isn’t good for doing high-throughput queries. To build a proper data warehouse or time-series analytics system, you typically still need to make updates/inserts and that was why we created Apache Kudu.

Kudu is a new storage system that combines the benefits of both HDFS and HBase into one: it allows for low-latency updates/inserts, but also supports high-throughput analytical queries (i.e. fast analytics on fast moving data).
Unlike HDFS, Kudu is not a file-system, it is a record-based system, so the unit of storage is a record as opposed to a file. This allows Kudu to unlock Impala for real-time streaming applications that were not possible with HDFS.
In HDFS the data would only be visible to Impala after we finish closing the file, which typically happens after a large number of records are accumulated (that adds latency between when records are written to when they become visible to the analytical engine). With Kudu as soon as a record is written it is immediately visible to the Impala analytical engine. Finally, just like HDFS and HBase, the Kudu storage engine is fully integrated with our entire stack, not just Impala.
For example, you can also use Apache Spark for machine-learning jobs directly against Kudu.

Q9. The Apache Hadoop project recently announced its 3.0.0-alpha1 release. What is it?

Amr Awadallah: HDFS Erasure Encoding is really the main exciting new feature in Hadoop 3. Traditionally HDFS required three replicas, by default, for every data block to achieve durability, concurrent performance, and availability. Using erasure encoding techniques, HDFS in Hadoop 3 allows us to significantly reduce the storage overhead from 3x (i.e. 200%) to just 20% extra bits for parity. This will allow us to achieve the same durability benefits of 3x replication, but comes at the cost of potentially lower concurrent performance (when more than one job are trying to access the same block at same time) and lower availability resilience in face of top-of-rack switch failures (less of an issue these days).

Other cool additions are ATS v2 and classpath isolation which you can read more about here

Q10. What is the roadmap ahead for Cloudera Enterprise?

Amr Awadallah: We don’t discuss details of our product roadmap publicly, but there are three guiding themes for us in 2017: The first theme is fast-analytics on fast-moving data (which I covered above in regards to Kudu).
The second theme is cloud, which is making Cloudera Enterprise work better in cloud environments, and make it easier to move workloads (and skill sets) from on-premise clusters to transient cloud clusters in AWS, Azure, and/or Google Cloud.
The third theme is simplifying data-science and machine learning development, especially reducing the time from when a new algorithm is developed to how it can be deployed into production (stay tuned for more on that front).
——————————
Amr Awadallah, Ph.D. Chief Technology Officer, Cloudera
Before co-founding Cloudera in 2008, Amr (@awadallah) was an Entrepreneur-in-Residence at Accel Partners. Prior to joining Accel he served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr joined Yahoo after they acquired his first startup, VivaSmart, in July of 2000. Amr holds a Bachelor’s and Master’s degrees in Electrical Engineering from Cairo University, Egypt, and a Doctorate in Electrical Engineering from Stanford University.

Resources

Download Page for Apache Spark™

Apache Impala supported by Cloudera Enterprise

DATA-X: Videobook- 8 short videos introduce query analytics for Apache Hadoop

A package that allows R developers to use Hadoop HBase

Book: Big Data Analytics with Spark

Related Posts

Streaming Analytics for Chain Monitoring. By Natalino Busa, Head of Data Science at Teradata — Thursday, ODBMS.org January 12, 2017

Five Challenges to IoT Analytics Success. By Dr. Srinath Perera. ODBMS.org SEPTEMBER 23, 2016

Next-Generation Genomics Analysis with Apache Spark. by Jason Bailey. ODBMS.org Thursday, June 30th, 2016

Supporting the Fast Data Paradigm with Apache Spark BY Stephen Dillon, Data Architect, Schneider Electric. ODBMS.org,23 APR, 2016

– The new series of Q&A with Leading Data Scientists– ODBMS.org:
Part II
Part I

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/feed/ 0
Big Data and The Great A.I. Awakening. Interview with Steve Lohr http://www.odbms.org/blog/2016/12/big-data-and-the-great-a-i-awakening-interview-with-steve-lohr/ http://www.odbms.org/blog/2016/12/big-data-and-the-great-a-i-awakening-interview-with-steve-lohr/#comments Mon, 19 Dec 2016 08:35:56 +0000 http://www.odbms.org/blog/?p=4274

“I think we’re just beginning to grapple with implications of data as an economic asset” –Steve Lohr.

My last interview for this year is with Steve Lohr. Steve Lohr has covered technology, business, and economics for the New York Times for more than twenty years. In 2013 he was part of the team awarded the Pulitzer Prize for Explanatory Reporting. We discussed Big Data and how it influences the new Artificial Intelligence awakening.

Wishing you all the best for the Holiday Season and a healthy and prosperous New Year!

RVZ

Q1. Why do you think Google (TensorFlow) and Microsoft (Computational Network Toolkit) are open-sourcing their AI software?

Steve Lohr: Both Google and Microsoft are contributing their tools to expand and enlarge the AI community, which is good for the world and good for their businesses. But I also think the move is a recognition that algorithms are not where their long-term advantage lies. Data is.

Q2. What are the implications of that for both business and policy?

Steve Lohr: The companies with big data pools can have great economic power. Today, that shortlist would include Google, Microsoft, Facebook, Amazon, Apple and Baidu.
I think we’re just beginning to grapple with implications of data as an economic asset. For example, you’re seeing that now with Microsoft’s plan to buy LinkedIn, with its personal profiles and professional connections for more than 400 million people. In the evolving data economy, is that an antitrust issue of concern?

Q3. In this competing world of AI, what is more important, vast data pools, sophisticated algorithms or deep pockets?

Steve Lohr: The best answer to that question, I think, came from a recent conversation with Andrew Ng, a Stanford professor who worked at GoogleX, is co-founder of Coursera and is now chief scientist at Baidu. I asked him why Baidu, and he replied there were only a few places to go to be a leader in A.I. Superior software algorithms, he explained, may give you an advantage for months, but probably no more. Instead, Ng said, you look for companies with two things — lots of capital and lots of data. “No one can replicate your data,” he said. “It’s the defensible barrier, not algorithms.”

Q4. What is the interplay and implications of big data and artificial intelligence?

Steve Lohr: The data revolution has made the recent AI advances possible. We’ve seen big improvements in the last few years, for example, in AI tasks like speech recognition and image recognition, using neural network and deep learning techniques. Those technologies have been around for decades, but they are getting a huge boost from the abundance of training data because of all the web image and voice data that can be tapped now.

Q5. Is data science really only a here-and-now version of AI?

Steve Lohr: No, certainly not only. But I do find that phrase a useful way to explain to most of my readers — intelligent people, but not computer scientists — the interplay between data science and AI. To convey that rudiments of data-driven AI are already all around us. It’s not — surely not yet — robot armies and self-driving cars as fixtures of everyday life. But it is internet search, product recommendations, targeted advertising and elements of personalized medicine, to cite a few examples.

Q6. Technology is moving beyond increasing the odds of making a sale, to being used in higher-stakes decisions like medical diagnosis, loan approvals, hiring and crime prevention. What are the societal implications of this?

Steve Lohr: The new, higher-stakes decisions that data science and AI tools are increasingly being used to make — or assist in making — are fundamentally different than marketing and advertising. In marketing and advertising, a decision that is better on average is plenty good enough. You’ve increased sales and made more money. You don’t really have to know why.
But the other decisions you mentioned are practically and ethically very different. These are crucial decisions about individual people’s lives. Better on average isn’t good enough. For these kinds of decisions, issues of accuracy, fairness and discrimination come into play.
That, I think, argues for two things. First, some sort of auditing tool; the technology has to be able to explain itself, to explain how a data-driven algorithm came to the decision or recommendation that it did.
Second, I think it argues for having a “human in the loop” for most of these kinds of decisions for the foreseeable future.

Q7. Will data analytics move into the mainstream of the economy (far beyond the well known, born-on-the-internet success stories like Google, Facebook and Amazon)?

Steve Lohr: Yes, and I think we’re seeing that now in nearly every field — health care, agriculture, transportation, energy and others. That said, it is still very early. It is a phenomenon that will play out for years, and decades.
Recently, I talked to Jeffrey Immelt, the chief executive of General Electric, America’s largest industrial company. GE is investing heavily to put data-generating sensors on its jet engines, power turbines, medical equipment and other machines — and to hire software engineers and data scientists.
Immelt said if you go back more than a century to the origins of the company, dating back to Thomas Edison‘s days, GE’s technical foundation has been materials science and physics. Data analytics, he said, will be the third fundamental technology for GE in the future.
I think that’s a pretty telling sign of where things are headed.

—————————–
Steve Lohr has covered technology, business, and economics for the New York Times for more than twenty years and writes for the Times’ Bits blog. In 2013 he was part of the team awarded the Pulitzer Prize for Explanatory Reporting.
He was a foreign correspondent for a decade and served as an editor, and has written for national publications such as the New York Times Magazine, the Atlantic, and the Washington Monthly. He is the author of Go To: The Story of the Math Majors, Bridge Players, Engineers, Chess Wizards, Maverick Scientists, Iconoclasts—the Programmers Who Created the Software Revolution and Data-ism The Revolution Transforming Decision Making, Consumer Behavior, and Almost Everything Else.
He lives in New York City.

————————–

Resources

Google (TensorFlow): TensorFlow™ is an open source software library for numerical computation using data flow graphs.

Microsoft (Computational Network Toolkit): A free, easy-to-use, open-source, commercial-grade toolkit that trains deep learning algorithms to learn like the human brain.

Data-ism The Revolution Transforming Decision Making, Consumer Behavior, and Almost Everything Else. by Steve Lohr. 2016 HarperCollins Publishers

Related Posts

Don’t Fear the Robots. By STEVE LOHR. -OCT. 24, 2015-The New York Times, SundayReview | NEWS ANALYSIS

G.E., the 124-Year-Old Software Start-Up. By STEVE LOHR. -AUG. 27, 2016- The New York Times, TECHNOLOGY

Machines of Loving Grace. Interview with John Markoff. ODBMS Industry Watch, Published on 2016-08-11

Recruit Institute of Technology. Interview with Alon Halevy. ODBMS Industry Watch, Published on 2016-04-02

Civility in the Age of Artificial Intelligence, by STEVE LOHR, technology reporter for The New York Times, ODBMS.org

On Artificial Intelligence and Society. Interview with Oren Etzioni, ODBMS Industry Watch.

On Big Data and Society. Interview with Viktor Mayer-Schönberger, ODBMS Industry Watch.

Follow us on Twitter:@odbmsorg

##

]]>
http://www.odbms.org/blog/2016/12/big-data-and-the-great-a-i-awakening-interview-with-steve-lohr/feed/ 1
New Gartner Magic Quadrant for Operational Database Management Systems. Interview with Nick Heudecker http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/ http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/#comments Wed, 30 Nov 2016 20:30:20 +0000 http://www.odbms.org/blog/?p=4272

“It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.”–Nick Heudecker.

I have interviewed Nick Heudecker, Research Director on Gartner’s Data & Analytics team.
The main topic of the interview is the new Magic Quadrant for Operational Database Management Systems.

RVZ

Q1. You have published the new Magic Quadrant for Operational Database Management Systems (*). How do you define the operational database management system market?

Nick Heudecker: We define a DBMS as a complete software system used to define, create, manage, update and query a database. DBMSs provide interfaces to independent programs and tools that both support and govern the performance of a variety of concurrent workload types. There is no presupposition that DBMSs must support the relational model or that they must support the full set of possible data types in use today. OPDBMSs must include functionality to support backup and recovery, and have some form of transaction durability — although the atomicity, consistency, isolation and durability model is not a requirement. OPDBMSs may support multiple delivery models, such as stand-alone DBMS software, certified configurations, cloud (public and private) images or versions, and database appliances.

Q2. Can you explain the methodology you used for this new Magic Quadrant?

Nick Heudecker: The methodologies for several Gartner methodologies are public. The Magic Quadrant methodology can be found here.

We use a number of data sources when we’re creating the Magic Quadrant for Operational Database Management Systems.
We survey vendor reference customers and include data from our interactions with Gartner clients. We also consider earlier information and any news about vendors’ products, customers and finances that came to light during the time frame for our analysis.

Once we have the data, we score vendors across the various dimensions of Completeness of Vision and Ability to Execute.
One thing that’s important to note is Magic Quadrants are relative assessments of vendors in a market. We couldn’t have one vendor on an MQ because it would be right in the middle – there’s nothing to compare it to.

Q3. Why were there no Visionaries this year?

Nick Heudecker: We determined there was an overall lack of vision in the market. After a few years of rapid feature expansion, the focus has shifted to operational excellence and execution. Even Leaders shifted to the left on vision, but are still placed in the Leaders quadrant based on their vision for the development of hybrid database management, hardware optimization and integration, emerging deployment models such as containerization, as well as vertical features.

Q4. Were you surprised by the analysis and some of the results you obtained?

Nick Heudecker: The lack of overall vision in the market struck us the most. Other than in a few notable cases, we received largely the same story from most vendors. The explosion of features, and the vendors emerging to implement them, has slowed. The features that initiated the expansion, such as storing new data types, geographically distributed storage, cloud and flexible data consistency models, have become common. Today, nearly every established or emerging DBMS vendor supports these features to some degree. The OPDBMS market has shifted from a phase of rapid innovation to a phase of maturing products and capabilities.

Q5. Do you believe the “NoSQL” label will continue to distinguish DBMSs?

Nick Heudecker: If you look at the entire operational DBMS space, there’s already a great deal of convergence between NoSQL vendors, as well as between NoSQL and traditionally relational vendors. Nearly every vendor, nonrelational and relational, supports multiple data types, like JSON documents, graph or wide-column. NoSQL vendors are adding SQL: MongoDB’s BI Connector and Couchbase’s N1QL are good, if diverse, examples. They’re also adding things like schema management and data validation capabilities.
On the relational side, they’re adding horizontal scaling options and alternative consistency models, as well as modern APIs. And everyone either has or is adding in-memory and cloud capabilities.

It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.

Q6. What are the other “Vendors to Consider”?

Nick Heudecker: The other vendors to consider are vendors that did not meet the inclusion requirements for the Magic Quadrant. Usually this is because they missed our minimum revenue requirements, but that doesn’t mean they don’t have compelling products.

——————————-
Nick Heudecker is a Research Director on Gartner’s Data & Analytics team. His coverage includes data management technologies and practices.

——————————-

Resources
(*) Magic Quadrant for Operational Database Management Systems. Published: 05 October 2016 ID: G00293203Analyst(s): Nick Heudecker, Donald Feinberg, Merv Adrian, Terilyn Palanca, Rick Greenwald

– Complimentary Gartner Research: 100 Data and Analytics Predictions Through 2020. Get exclusive access to Gartner’s top 100 data and analytics predictions through 2020. Plus access other relevant Gartner research including Magic Quadrant reports for database and data warehouse solutions, and the market guide for in-memory computing (LINK to MemSQL web site – registration required).

Related Posts

MarkLogic Named a Next-Generation Database Challenger in 2016 Gartner Magic Quadrant. By GARY BLOOM, Chief Executive Officer and President MARKLOGIC

MarkLogic Recognized in New Gartner® Magic Quadrant. Gartner Magic Quadrant for Operational Database Management Systems positions MarkLogic® the highest for ability to execute in the Challengers Quadrant

– Accelerating Business Value with a Multi-Model, Multi-Workload Data Platform

– NuoDB Recognized by Gartner in Critical Capabilities for Operational Database Management Systems. Elastic SQL database achieves top five score in all four use cases.

– Clustrix Recognized in Gartner Magic Quadrant for Operational Database Management Systems

– Learn why EDB is named a “Challenger” in the 2016 Gartner ODBMS Magic Quadrant

– DataStax Receives Highest Scores in 2 Use Cases in Gartner’s Critical Capabilities for Operational Database Management Systems

– Gartner Scores Oracle Highest In 3 of 4 Use Cases: Gartner Critical Capabilities for Operational Database Management Systems Report

Gartner Critical Capabilities For Operational Database Management Systems 2016 – Redis Labs Ranked Second Highest In 2/4 Categories (Link- Registation required)

 

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/feed/ 0
Database Challenges and Innovations. Interview with Jim Starkey http://www.odbms.org/blog/2016/08/database-challenges-and-innovations-interview-with-jim-starkey/ http://www.odbms.org/blog/2016/08/database-challenges-and-innovations-interview-with-jim-starkey/#comments Wed, 31 Aug 2016 03:33:42 +0000 http://www.odbms.org/blog/?p=4218

“Isn’t it ironic that in 2016 a non-skilled user can find a web page from Google’s untold petabytes of data in millisecond time, but a highly trained SQL expert can’t do the same thing in a relational database one billionth the size?.–Jim Starkey.

I have interviewed Jim Starkey. A database legendJim’s career as an entrepreneur, architect, and innovator spans more than three decades of database history.

RVZ

Q1. In your opinion, what are the most significant advances in databases in the last few years?

Jim Starkey: I’d have to say the “atom programming model” where a database is layered on a substrate of peer-to-peer replicating distributed objects rather than disk files. The atom programming model enables scalability, redundancy, high availability, and distribution not available in traditional, disk-based database architectures.

Q2. What was your original motivation to invent the NuoDB Emergent Architecture?

Jim Starkey: It all grew out of a long Sunday morning shower. I knew that the performance limits of single-computer database systems were in sight, so distributing the load was the only possible solution, but existing distributed systems required that a new node copy a complete database or partition before it could do useful work. I started thinking of ways to attack this problem and came up with the idea of peer to peer replicating distributed objects that could be serialized for network delivery and persisted to disk. It was a pretty neat idea. I came out much later with the core architecture nearly complete and very wrinkled (we have an awesome domestic hot water system).

Q3. In your career as an entrepreneur and architect what was the most significant innovation you did?

Jim Starkey: Oh, clearly multi-generational concurrency control (MVCC). The problem I was trying to solve was allowing ad hoc access to a production database for a 4GL product I was working on at the time, but the ramifications go far beyond that. MVCC is the core technology that makes true distributed database systems possible. Transaction serialization is like Newtonian physics – all observers share a single universal reference frame. MVCC is like special relativity, where each observer views the universe from his or her reference frame. The views appear different but are, in fact, consistent.

Q4. Proprietary vs. open source software: what are the pros and cons?

Jim Starkey: It’s complicated. I’ve had feet in both camps for 15 years. But let’s draw a distinction between open source and open development. Open development – where anyone can contribute – is pretty good at delivering implementations of established technologies, but it’s very difficult to push the state of the art in that environment. Innovation, in my experience, requires focus, vision, and consistency that are hard to maintain in open development. If you have a controlled development environment, the question of open source versus propriety is tactics, not philosophy. Yes, there’s an argument that having the source available gives users guarantees they don’t get from proprietary software, but with something as complicated as a database, most users aren’t going to try to master the sources. But having source available lowers the perceived risk of new technologies, which is a big plus.

Q5. You led the Falcon project – a transactional storage engine for the MySQL server- through the acquisition of MySQL by Sun Microsystems. What impact did it have this project in the database space?

Jim Starkey: In all honesty, I’d have to say that Falcon’s most important contribution was its competition with InnoDB. In the end, that competition made InnoDB three times faster. Falcon, multi-version in memory using the disk for backfill, was interesting, but no matter how we cut it, it was limited by the performance of the machine it ran on. It was fast, but no single node database can be fast enough.

Q6. What are the most challenging issues in databases right now?

Jim Starkey: I think it’s time to step back and reexamine the assumptions that have accreted around database technology – data model, API, access language, data semantics, and implementation architectures. The “relational model”, for example, is based on what Codd called relations and we call tables, but otherwise have nothing to do with his mathematic model. That model, based on set theory, requires automatic duplicate elimination. To the best of my knowledge, nobody ever implemented Codd’s model, but we still have tables which bear a scary resemblance to decks of punch cards. Are they necessary? Or do they just get in the way?
Isn’t it ironic that in 2016 a non-skilled user can find a web page from Google’s untold petabytes of data in millisecond time, but a highly trained SQL expert can’t do the same thing in a relational database one billionth the size?. SQL has no provision for flexible text search, no provision for multi-column, multi-table search, and no mechanics in the APIs to handle the results if it could do them. And this is just one a dozen problems that SQL databases can’t handle. It was a really good technical fit for computers, memory, and disks of the 1980’s, but is it right answer now?

Q7. How do you see the database market evolving?

Jim Starkey: I’m afraid my crystal ball isn’t that good. Blobs, another of my creations, spread throughout the industry in two years. MVCC took 25 years to become ubiquitous. I have a good idea of where I think it should go, but little expectation of how or when it will.

Qx. Anything else you wish to add?

Jim Starkey: Let me say a few things about my current project, AmorphousDB, an implementation of the Amorphous Data Model (meaning, no data model at all). AmorphousDB is my modest effort to question everything database.
The best way to think about Amorphous is to envision a relational database and mentally erase the boxes around the tables so all records free float in the same space – including data and metadata. Then, if you’re uncomfortable, add back a “record type” attribute and associated syntactic sugar, so table-type semantics are available, but optional. Then abandon punch card data semantics and view all data as abstract and subject to search. Eliminate the fourteen different types of numbers and strings, leaving simply numbers and strings, but add useful types like URL’s, email addresses, and money. Index everything unless told not to. Finally, imagine an API that fits on a single sheet of paper (OK, 9 point font, both sides) and an implementation that can span hundreds of nodes. That’s AmorphousDB.

————
Jim Starkey invented the NuoDB Emergent Architecture, and developed the initial implementation of the product. He founded NuoDB [formerly NimbusDB] in 2008, and retired at the end of 2012, shortly before the NuoDB product launch.

Jim’s career as an entrepreneur, architect, and innovator spans more than three decades of database history from the Datacomputer project on the fledgling ARPAnet to his most recent startup, NuoDB, Inc. Through the period, he has been
responsible for many database innovations from the date data type to the BLOB to multi-version concurrency control (MVCC). Starkey has extensive experience in proprietary and open source software.

Starkey joined Digital Equipment Corporation in 1975, where he created the Datatrieve family of products, the DEC Standard Relational Interface architecture, and the first of the Rdb products, Rdb/ELN. Starkey was also software architect for DEC’s database machine group.

Leaving DEC in 1984, Starkey founded Interbase Software to develop relational database software for the engineering workstation market. Interbase was a technical leader in the database industry producing the first commercial implementations of heterogeneous networking, blobs, triggers, two phase commit, database events, etc. Ashton-Tate acquired Interbase Software in 1991, and was, in turn, acquired by Borland International a few months later. The Interbase database engine was released open source by Borland in 2000 and became the basis for the Firebird open source database project.

In 2000, Starkey founded Netfrastructure, Inc., to build a unified platform for distributable, high quality Web applications. The Netfrastructure platform included a relational database engine, an integrated search engine, an integrated Java virtual machine, and a high performance page generator.

MySQL, AB, acquired Netfrastructure, Inc. in 2006 to be the kernel of a wholly owned transactional storage engine for the MySQL server, later known as Falcon. Starkey led the Falcon project through the acquisition of MySQL by Sun Microsystems.

Jim has a degree in Mathematics from the University of Wisconsin.
For amusement, Jim codes on weekends, while sailing, but not while flying his plane.

——————

Resources

NuoDB Emergent Architecture (.PDF)

On Database Resilience. Interview with Seth Proctor, ODBMs Industry Watch, March 17, 2015

Related Posts

– Challenges and Opportunities of The Internet of Things. Interview with Steve Cellini, ODBMS Industry Watch, October 7, 2015

– Hands-On with NuoDB and Docker, BY MJ Michaels, NuoDB. ODBMS.org– OCT 27 2015

– How leading Operational DBMSs rank popularity wise? By Michael Waclawiczek– ODBMS.org · JANUARY 27, 2016

– A Glimpse into U-SQL BY Stephen Dillon, Schneider Electric, ODBMS.org-DECEMBER 7, 2015

– Gartner Magic Quadrant for Operational DBMS 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/08/database-challenges-and-innovations-interview-with-jim-starkey/feed/ 0
Machines of Loving Grace. Interview with John Markoff. http://www.odbms.org/blog/2016/08/machines-of-loving-grace-interview-with-john-markoff/ http://www.odbms.org/blog/2016/08/machines-of-loving-grace-interview-with-john-markoff/#comments Thu, 11 Aug 2016 19:13:46 +0000 http://www.odbms.org/blog/?p=4190

“Intelligent system designers do have ethical responsibilities.”
–John Markoff.

I have interviewed John Markoff, technology writer at The New York Times. 
In 2013 he was awarded a Pulitzer Prize.
The interview is related to his recent book “Machines of Loving Grace: The Quest for Common Ground Between Humans and Robots, published in August of 2015 by HarperCollins Ecco.

RVZ

Q1. Do you share the concerns of prominent technology leaders such as Tesla’s chief executive, Elon Musk, who suggested we might need to regulate the development of artificial intelligence?

John Markoff: I share their concerns, but not their assertions that we may be on the cusp of some kind of singularity or rapid advance to artificial general intelligence. I do think that machine autonomy raises specific ethical and safety concerns and regulation is an obvious response.

Q2. How difficult is it to reconcile the different interests of the people who are involved in a direct or indirect way in developing and deploying new technology?

John Markoff: This is why we have governments and governmental regulation. I think AI, in that respect is no different than any other technology. It should and can be regulated when human safety is at stake.

Q3. In your book Machines of Loving Grace you argued that “we must decide to design ourselves into our future, or risk being excluded from it altogether”. What do you mean by that?

John Markoff: You can use AI technologies either to automate or to augment humans. The problem is minimized when you take an approach that is based on human centric design principles.

Q4. How is it possible in practice? Isn’t the technology space dominated by giants such as IBM, Apple,Google who dictate the direction of new technology?

John Markoff:  This is a very interesting time with “giant” technology companies realizing that there are consequences in the deployment of these technologies. Google, IBM and Microsoft have all recently made public commitments to the safe use of AI.

Q5. What are the most significant new developments in the humans-computers area, that are likely to have a significant influence in our daily life in the near future?

John Markoff:  One of the best things about being a reporter is that you don’t have to predict the future. You only have to note what the various visionaries say, so you can call that to their attention when their predictions prove inaccurate. With that caveat, if I am forced to bet on any particular information technology it would be augmented reality. This is because I believe that multi-touch interfaces for mobile devices simply can’t be the last step in user interface.

Q6. Do you believe that robots will really transform modern life?

John Markoff:  I struggle with the definition of what is a “robot.” If something is tele-operated, for example, is it a robot? That said I think that we will increasingly be surrounded by machines that perform tasks.
The question is will they come as quickly as Silicon Valley seems to believe. My friend Paul Saffo has said, “Never mistake a clear view for a short distance.” And I think that is the case with all kinds of mobile robots, including self driving cars.

Q7. For the designers of Intelligent Systems, how difficult is to draw a line between what is human and what is machine?

John Markoff:  I feel strongly that the possibility of designing cyborgs, particularly with respect to intellectual prosthesis is a boundary we should cross with great caution. Remember the Borg from StarTrek. “Resistance is futile, you will be assimilated.” I think the challenge is to use these systems to enhance human thought, not for social control.

Q8. What are the ethical responsibilities of designers of intelligent systems?

John Markoff: I think the most important aspect of that question is the simple acknowledgement that intelligent system designers do have ethical responsibilities. That has not always been the case, but it seems to be a growing force within the community of AI and robotics designers in the past five years, so I’m not entirely pessimistic.

Q9. If humans delegate decisions to machines, who will be responsible for the consequences?

John Markoff: Ben Shneiderman, the University of Maryland computer scientist and user interface designer has written eloquently on this point. Indeed he argues against autonomous systems for precisely this reason. His point is that it is essential to keep a human in the loop. If not you run the risk of abdicating ethical responsibility for system design.

Q10. Assuming there is a real potential in using data–driven methods to both help charities develop better services and products, and understand civil society activity. In your opinion, what are the key lessons and recommendations for future work in this space?

John Markoff: I’m afraid I’m not an expert in the IT needs of either charities or NGOs. That said a wide range of AI advances are already being delivered at nominal cost via smart phones. As cheap sensors proliferate virtually all everyday objects will gain intelligence that will be widely accessible.

Qx. Anything else you wish to add?

John Markoff: Only that I think it is interesting that the augmentation vs automation dichotomy is increasingly seen as a path through which to navigate the impact of these technologies. Computer system designers are the ones who will decide what the impact of these technologies are and whether to replace or augment humans in society.

—————————————-

JOHN GREGORY MARKOFF

John Markoff joined The New York Times in March 1988 as a reporter for the business section. He is now a technology writer based in San Francisco bureau of the paper. Prior to joining the Times, he worked for The San Francisco Examiner from 1985 to 1988. He reported for the New York Times Science Section from 2010 to 2015.

Markoff has written about technology and science since 1977. He covered technology and the defense industry for The Pacific News Service in San Francisco from 1977 to 1981; he was a reporter at Infoworld from 1981 to 1983; he was the West Coast editor for Byte Magazine from 1984 to 1985 and wrote a column on personal computers for The San Jose Mercury from 1983 to 1985.

He has also been a lecturer at the University of California at Berkeley School of Journalism and an adjunct faculty member of the Stanford Graduate Program on Journalism.

The Times nominated him for a Pulitzer Prize in 1995, 1998 and 2000. The San Francisco Examiner nominated him for a Pulitzer in 1987. In 2005, with a group of Times reporters, he received the Loeb Award for business journalism. In 2007 he shared the Society of American Business Editors and Writers Breaking News award. In 2013 he was awarded a Pulitzer Prize in explanatory reporting as part of a New York Times project on labor and automation.

In 2007 he became a member of the International Media Council at the World Economic Forum. Also in 2007, he was named a fellow of the Society of Professional Journalists, the organization’s highest honor.

In June of 2010 the New York Times presented him with the Nathaniel Nash Award, which is given annually for foreign and business reporting.

Born in Oakland, California on October 29, 1949, Markoff grew up in Palo Alto, California and graduated from Whitman College, Walla Walla, Washington, in 1971. He attended graduate school at the University of Oregon and received a masters degree in sociology in 1976.

Markoff is the co-author of “The High Cost of High Tech,” published in 1985 by Harper & Row. He wrote “Cyberpunk: Outlaws and Hackers on the Computer Frontier” with Katie Hafner, which was published in 1991 by Simon & Schuster.
In January of 1996 Hyperion published “Takedown: The Pursuit and Capture of America’s Most Wanted Computer Outlaw,” which he co-authored with Tsutomu Shimomura. “What the Dormouse Said: How the Sixties Counterculture shaped the Personal Computer Industry,” was published in 2005 by Viking Books. “Machines of Loving Grace: The Quest for Common Ground Between Humans and Robots,” was published in August of 2015 by HarperCollins Ecco.

He is currently researching a biography of Stewart Brand.

He is married to Leslie Terzian Markoff and they live in San Francisco, Calif.

Resources

MACHINES OF LOVING GRACE – The Quest for Common Ground Between Humans and Robots By John Markoff, Illustrated. 378 pp. Ecco/HarperCollins Publishers.

Shneiderman’s “Eight Golden Rules of Interface Design”. These rules were obtained from the text Designing the User Interface by Ben Shneiderman.

“Designing the User Interface”, 6th Edition. This is a revised edition of the highly successful textbook on Human Computer Interaction originally developed by Ben Shneiderman and Catherine Plaisant at the University of Maryland.

Related Posts

– Recruit Institute of Technology. Interview with Alon Halevy ODBMS Industry Watch, Published on 2016-04-02

– Civility in the Age of Artificial Intelligence,  by STEVE LOHR, technology reporter for The New York Times, ODBMS.org

– On Artificial Intelligence and Society. Interview with Oren Etzioni, ODBMS Industry Watch.

– On Big Data and Society. Interview with Viktor Mayer-SchönbergerODBMS Industry Watch.

Follow us on Twitter: @odbmsorg

# #

]]>
http://www.odbms.org/blog/2016/08/machines-of-loving-grace-interview-with-john-markoff/feed/ 3
Recruit Institute of Technology. Interview with Alon Halevy http://www.odbms.org/blog/2016/04/recruit-institute-of-technology-interview-with-alon-halevy/ http://www.odbms.org/blog/2016/04/recruit-institute-of-technology-interview-with-alon-halevy/#comments Sat, 02 Apr 2016 15:10:02 +0000 http://www.odbms.org/blog/?p=4112

” A revolution will happen when tools like Siri can truly serve as your personal assistant and you start relying on such an assistant throughout your day. To get there, these systems need more knowledge about your life and preferences, more knowledge about the world, better conversational interfaces and at least basic commonsense reasoning capabilities. We’re still quite far from achieving these goals.”–Alon Halevy

I have interviewed Alon Halevy, CEO at Recruit Institute of Technology.

RVZ

Q1. What is the mission of the Recruit Institute of Technology?

Alon Halevy: Before I describe the mission, I should introduce our parent company Recruit Holdings to those who may not be familiar with it. Recruit (founded in 1960), is a leading “life-style” information services and human resources company in Japan with services in the areas of recruitment, advertising, employment placement, staffing, education, housing and real estate, bridal, travel, dining, beauty, automobiles and others. The company is currently expanding worldwide and operates similar businesses in the U.S., Europe and Asia. In terms of size, Recruit has over 30,000 employees and its revenues are similar to those of Facebook at this point in time.

The mission of R.I.T is threefold. First, being the lab of Recruit Holdings, our goal is to develop technologies that improve the products and services of our subsidiary companies and create value for our customers from  the vast collections of data we have. Second, our mission is to advance scientific knowledge by contributing to the research community through publications in top-notch venues. Third, we strive to use technology for social good. This latter goal may be achieved through contributing to open-source software, working on digital artifacts that would be of general use to society, or even working with experts in a particular domain to contribute to a cause.

Q2. Isn`t similar to the mission of the Allen Institute for Artificial Intelligence?

Alon Halevy: The Allen Institute is a non-profit whose admirable goal is to make fundamental contributions to Artificial Intelligence. While R.I.T strives to make fundamental contributions to A.I and related areas such as data management, we plan to work closely with our subsidiary companies and to impact the world through their products.

Q3. Driverless cars, digital Personal Assistants (e.g. Siri), Big Data, the Internet of Things, Robots: Are we on the brink of the next stage of the computer revolution?

Alon Halevy: I think we are seeing many applications in which AI and data (big or small) are starting to make a real difference and affecting people’s lives. We will see much more of it in the next few years as we refine our techniques. A revolution will happen when tools like Siri can truly serve as your personal assistant and you start relying on such an assistant throughout your day. To get there, these systems need more knowledge about your life and preferences, more knowledge about the world, better conversational interfaces and at least basic commonsense reasoning capabilities. We’re still quite far from achieving these goals.

Q4. You were for more than 10 years senior staff research scientist at Google, leading the Structured Data Group in Google Research. Was it difficult to leave Google?

Alon Halevy: It was extremely difficult leaving Google! I struggled with the decision for quite a while, and waving goodbye to my amazing team on my last day was emotionally heart wrenching. Google is an amazing company and I learned so much from my colleagues there. Fortunately, I’m very excited about my new colleagues and the entrepreneurial spirit of Recruit.
One of my goals at R.I.T is to build a lab with the same culture as that of Google and Google Research. So in a sense, I’m hoping to take Google with me. Some of my experiences from a decade at Google that are relevant to building a successful research lab are described in a blog post I contributed to the SIGMOD blog in September, 2015.

Q5. What is your vision for the next three years for the Recruit Institute of Technology?

Alon Halevy: I want to build a vibrant lab with world-class researchers and engineers. I would like the lab to become a world leader in the broad area of making data usable, which includes data discovery, cleaning, integration, visualization and analysis.
In addition, I would like the lab to build collaborations with disciplines outside of Computer Science where computing techniques can make an even broader impact on society.

Q6. What are the most important research topics you intend to work on?

Alon Halevy: One of the roadblocks to applying AI and analysis techniques more widely within enterprises is data preparation.
Before you can analyze data or apply AI techniques to it, you need to be able to discover which datasets exist in the enterprise, understand the semantics of a dataset and its underlying assumptions, and to combine disparate datasets as needed. We plan to work on the full spectrum of these challenges with the goal of enabling many more people in the enterprise to explore their data.

Recruit being a lifestyle company, another  fundamental question we plan to investigate is whether technology can help people make better life decisions. In particular, can technology help you take into consideration many factors in your life as you make decisions and steer you towards decisions that will make you happier over time. Clearly, we’ll need more than computer scientists to even ask the right questions here.

Q7. If we delegate decisions to machines, who will be responsible for the consequences? What are the ethical responsibilities of designers of intelligent systems?

Alon Halevy: You got an excellent answer from Oren Etzioni to this question in a recent interview. I agree with him fully and could not say it any better than he did.

Qx Anything you wish to add?

Alon Halevy: Yes. We’re hiring! If you’re a researcher or strong engineer who wants to make real impact on products and services in the fascinating area of lifestyle events and decision making, please consider R.I.T!

———-

Alon Halevy is the Executive Director of the Recruit Institute of Technology. From 2005 to 2015 he headed the Structured Data Management Research group at Google. Prior to that, he was a professor of Computer Science at the University of Washington in Seattle, where he founded the Database Group. In 1999, Dr. Halevy co-founded Nimble Technology, one of the first companies in the Enterprise Information Integration space, and in 2004, Dr. Halevy founded Transformic, a company that created search engines for the deep web, and was acquired by Google.
Dr. Halevy is a Fellow of the Association for Computing Machinery, received the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2000, and was a Sloan Fellow (1999-2000). Halevy is the author of the book “The Infinite Emotions of Coffee”, published in 2011, and serves on the board of the Alliance of Coffee Excellence.
He is also a co-author of the book “Principles of Data Integration”, published in 2012.
Dr. Halevy received his Ph.D in Computer Science from Stanford University in 1993 and his Bachelors from the Hebrew University in Jerusalem.

Resources

– Civility in the Age of Artificial Intelligence,  by STEVE LOHR, technology reporter for The New York Times, ODBMS.org

The threat from AI is real, but everyone has it wrong, by Robert Munro, CEO Idibon, ODBMS.org

Related Posts

On Artificial Intelligence and Society. Interview with Oren Etzioni, ODBMS Industry Watch.

– On Big Data and Society. Interview with Viktor Mayer-Schönberger ODBMS Industry Watch.

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/04/recruit-institute-of-technology-interview-with-alon-halevy/feed/ 0
On Dark Data. Interview with Gideon Goldin http://www.odbms.org/blog/2015/11/on-dark-data-interview-with-gideon-goldin/ http://www.odbms.org/blog/2015/11/on-dark-data-interview-with-gideon-goldin/#comments Mon, 16 Nov 2015 12:19:11 +0000 http://www.odbms.org/blog/?p=4023

“Top­down cataloging and master­data management tools typically require expensive data curators, and are not simple to use. This poses a significant threat to cataloging efforts since so much knowledge about your organization’s data is inevitably clustered across the minds of the people who need to question it and the applications they use to answer those questions.”–Gideon Goldin

I have interviewed Gideon Goldin, UX Architect, Product Manager at Tamr.

RVZ

Q1. What is “dark data”?

Gideon Goldin: Gartner refers to dark data as “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing).” For most organizations, dark data comprises the majority of available data, and it is often the result of the constantly changing and unpredictable nature of enterprise data ­something that is likely to be exacerbated by corporate restructuring, M&A activity, and a number of external factors.

By shedding light on this data, organizations are better suited to make more data­driven, accurate business decisions.
Tamr Catalog, which is available as a free downloadable app, aims to do this, providing users with a view of their entire data landscape so they can quickly understand what was in the dark and why.

Q2. What are the main drawbacks of traditional top­down methods of cataloging or “master data management”?

Gideon Goldin: The main drawbacks are scalability and simplicity. When Yahoo, for example, started to catalog the web they employed some top­down approaches, hiring specialists to curate structured directories of information. As the web grew, however, their solution became less relevant and significantly more costly. Google, on the other hand, mined the web to understand references that exist between pages, allowing the relevance of sites to emerge from the bottom­up. As a result, Google’s search engine was more accurate, easier to scale, and simpler.

Top­down cataloging and master­data management tools typically require expensive data curators, and are not simple to use. This poses a significant threat to cataloging efforts since so much knowledge about your organization’s data is inevitably clustered across the minds of the people who need to question it and the applications they use to answer those questions. Tamr Catalog aims to deliver an innovative and vastly simplified method for cataloging your organization’s data.

Q3. Tamr recently opened a public Beta program ­Tamr Catalog ­ for an enterprise metadata catalog. What is it?

Gideon Goldin: The Tamr Catalog Beta Program is an open invitation to test­drive our free cataloging software. We have yet to find an organization that is content with their current cataloging approaches, and we found that the biggest barrier to reform is often knowing where to start. Catalog can help: the goal of the Catalog Beta Program is to better understand how people want and need to collaborate around their data sources. We believe that an early partnership with the community will ensure that we develop useful functionality and thoughtful design.

Q4 What are the core functionality of Tamr Catalog?

Gideon Goldin: Tamr Catalog enables users to easily register, discover and organize their data assets.

Q5. How does it help simplify access to high­quality data sets for analytics?

Gideon Goldin: Not surprisingly, people are biased to use the data sets closest to them. With Catalog, scientists and analysts can easily discover unfamiliar data sets­­data sets, for example, that may belong to other departments or analysts. Catalog profiles and collects pointers to your sources, providing multifaceted and visual browsing of all data trivializing the search for any given set of data.

Q6. How does Tamr Catalog relate to the Tamr Data Unification Platform?

Gideon Goldin: Before organizations can unify their data, preparing it for improved analysis or management, they need to know what they have. Organizations often lack a good approach for this first (and repeating) step in data unification. We realized this quickly when helping large organizations begin their unification projects, and we even realized we lacked a satisfactory tool to understand our own data. Thus, we built Catalog as a part of the Tamr Data Unification Platform to illuminate your data landscape, such that people can be confident that their unification efforts are as comprehensive as possible.

Q7. What are the main challenges (technical and non technical) in achieving a broad adoption of a vendor­ and platform ­neutral metadata cataloging?

Gideon Goldin: Often the challenge isn’t about volume, it’s about variety. While a vendor­ neutral Catalog intends to solve exactly this, there remains a technical challenge in providing a flexible and elegant interface for cataloging dozens or hundreds of different types of data sets and the structures they comprise.

However, we find that some of the biggest (and most interesting) challenges revolve around organizational processes and culture. Some organizations have developed sophisticated but unsustainable approaches to managing their data, while others have become paralyzed by the inherently disorganized nature of their data. It can be difficult to appreciate the value of investing in these problems. Figuring out where to start, however, shouldn’t be difficult. This is why we chose to release a lightweight application free of charge.

Q8. Chief Data Officers (CDOs), data architects and business analysts have different requirements and different modes of collaborating on (shared) data sets. How do you address this in your catalog?

Gideon Goldin: The goal of cataloging isn’t cataloging, it’s helping CDOs identify business opportunities, empowering architects to improve infrastructures, enabling analysts to enrich their studies, and more. Catalog allows anyone to register and organize sources, encouraging open communication along the way.

Q9. How do you handle issues such as data protection, ownership, provenance and licensing in the Tamr catalog?

Gideon Goldin: Catalog allows users to indicate who owns what. Over the course of our Beta program, we have been fortunate enough to have over 800 early users of Catalog and have collected feedback about how our users would like to see data protection and provenance implemented in their own environments. We are eager to release new functionality to address these needs in the near future.

Q10. Do you plan to use the Tamr Catalog also for collecting data sets that can be used for data projects for the Common Good?

Gideon Goldin: We do­­ know of a few instances of Catalog being used for such purposes, including projects that will build on the documenting of city and​ ​health data. In addition to our Catalog Beta Program, we are introducing a Community Developer Program, where we are eager to see how the community links Tamr Catalogs to new sources (including those in other catalogs), new analytics and visualizations, and ultimately insights. We believe in the power of open data at Tamr, and we’re excited to learn how we can help the Common Good.

—————————–
Gideon Goldin, UX Architect, Product Manager at Tamr.

Prior to Tamr, Gideon Goldin worked as a data visualization/UX consultant and university lecturer. He holds a Masters in HCI and a PhD in cognitive science from Brown University, and is interested in designing novel human­machine experiences. You can reach Gideon on Twitter at @gideongoldin or email him at Gideon.Goldin at tamr.com.

Resources

–  Download Free Tamr Catalog app.

-​Tamr Catalog Developer Community
 Online community where Tamr catalog users can comment, interact directly with the development team, and learn more about the software; and where​ developers can explore extending the tool by creating new data connectors.

Gartner IT Glossary: Dark data

Related Posts

Data for the Common Good. Interview with Andrea Powell. ODBMS Industry Watch, June 9, 2015

Doubt and Verify: Data Science Power Tools By Michael L. Brodie, CSAIL, MIT

Data Wisdom for Data Science Bin Yu, Departments of Statistics and EECS, University of California at Berkeley

Follow ODBMs.org on Twitter: @odbmsorg

]]>
http://www.odbms.org/blog/2015/11/on-dark-data-interview-with-gideon-goldin/feed/ 0