On Big Data and NoSQL. Interview with Renat Khasanshyn.
“The most important thing is to focus on a task you need to solve instead of a technology” –Renat Khasanshyn.
I have interviewed Renat Khasanshyn, Founder and Chairman of Altoros.
Renat is a NoSQL and Big Data specialist.
Q1. In your opinion, what are the most popular players in the NoSQL market?
Khasanshyn: I think, MongoDB is definitely one of the key players of the NoSQL market. This database has a long history, I mean for this kind of products, and a good commercial support. For many people this database became the first mass market NoSQL store. I can assume that MongoDB is going to become something like MySQL for the field of relational databases. The second position I would give to Cassandra. It has a great architecture and enables building clusters with geographically dispersed nodes. For me it seems absolutely amazing. In addition, this database is often chosen by big companies that need a large highly available cluster.
Q2. How do you evaluate and compare different NoSQL Databases?
Khasanshyn: Thank you for an interesting question. How to choose a database? Which one is the best? These are the main questions for any company that wants to try a NoSQL solution. Definitely, for some cases it may be quite easy to select a suitable NoSQL store. However, very often it is not enough just to know customer’s business goals. When we suggest a particular database we take into consideration the following factors: the business issues a NoSQL store should solve, database reading/writing speed, availability, scalability, and many other important indicators. Sometimes we use a hybrid solution that may include several NoSQL databases.
Or we can see that a relational database will be a good match for a case. The most important thing is to focus on a task you need to solve instead of a technology.
We think that a good scalability, performance, and ease of administration are the most important criteria for choosing a NoSQL database. These are the key factors that we take into consideration. Of course, there are some additional criteria that sometimes may be even more important than those mentioned above. To somehow simplify a choice of a database for our engineers and for many other people, for two years, we carry out independent tests that evaluate performance of different NoSQL databases. Although aimed at comparing performance, these investigations also touch consistency, scalability, and configuration issues. You can take a look at our most recent article on this subject on Network World. Some new researches on this subject are to be published in a couple of months.
Q3. Which NoSQL databases did you evaluate so far, and what are the results did you obtain?
Khasanshyn: We used a great variety of NoSQL databases, for instance, Cassandra, HBase, MongoDB, Riak, Couchbase Server, Redis, etc. for our researches and real-life projects. From this experience, we learned that one should be very careful when choosing a database. It is better to spend more time on architecture design and make some changes to the project in the beginning rather than come across a serious issue in the future.
Q4. For which projects did use NoSQL databases and for what kind of problems?
Khasanshyn: It is hardly possible to name a project for which a NoSQL database would be useless, except for a blog or a home page. As the main use cases for NoSQL stores I would mention the following tasks:
● collecting and analyzing large volumes of data
● scaling large historical databases
● building interactive applications for which performance and fast response time to users’ actions are crucial
The major “drawback” of NoSQL architecture is the absence of ACID engine that provides a verification of transaction. It means that financial operations or user registration should be better performed by RDBMS like Oracle or MS SQL. However, absence of ACID allows for significant acceleration and decentralization of NoSQL databases which are their major advantages. The bottom line, non-relational databases are much faster in general, and they pay for it with a fraction of their reliability. Is it a good tradeoff? Depends on the task.
Q5. What do you see happening with NoSQL, going forward?
Khasanshyn: It’s quite difficult to make any predictions, but we guess that NoSQL and relational databases will become closer. For instance, NewSQL solutions took good scalability from NoSQL and a query language from the SQL products.
Probably, a kind of a standard query language based on SQL or SQL-like language will soon appear for NoSQL stores. We are also looking forward to improved consistency, or to be more precise, better predictability of NoSQL databases. In other words, NoSQL solutions should become more mature. We will also see some market consolidation. Smaller players will form alliances or quit the game. Leaders will take bigger part of the market. We will most likely see a couple of acquisitions. Overall, it will be easier to work with NoSQL and to choose a right solution out of the available options.
Q6. What what do you think is still needed for big data analytics to be really useful for the enterprise?
Khasanshyn: It is all about people and their skills. Storage is relatively cheap and available. Variety of databases is enormous and it helps solving virtually any task. Hadoop is stable. Majority of software is open source-based, or at least doesn’t cost a fortune. But all these components are useless without data scientists who can do modeling and simulations on the existing data. As well as without engineers who can efficiently employ the toolset. As well as without managers who understand the outcomes of data handling revolution that happened just recently. When we have these three types of people in place, then we will say that enterprises are fully equipped for making an edge in big data analytics.
Q7. Hadoop is still quite new for many enterprises, and different enterprises are at different stages in their Hadoop journey. When you speak with your customers what are the typical use cases and requirements they have?
Khasanshyn: I agree with you. Some customers just make their first steps with Hadoop while others need to know how to optimize their Hadoop-based systems. Unfortunately, the second group of customers is much smaller. I can name the following typical tasks our customers have:
● To process historical data that has been collected for a long period of time. Some time ago, users were unable to process large volumes of unstructured data due to some financial and technical limitations. Now Hadoop can do it at a moderate cost and for reasonable time.
● To develop a system for data analysis based on Hadoop. Once an hour, the system builds patterns on a typical user behavior on a Web site. These patterns help to react to users’ actions in the real-time mode, for instance, allow doing something or temporary block some actions because they are not typical of this user. The data is collected continuously and is analyzed at the same time. So, the system can rapidly respond to the changes in the user behavior.
● To optimize data storage. It is interesting that in some cases HDFS can replace a database, especially when the database was used for storing raw input data. Such projects do not need an additional database level.
I should say that our customers have similar requirements. Apart from solving a particular business task, they need a certain level of performance and data consistency.
Q8. In your opinion is Hadoop replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?
Khasanshyn: In a few words, my answer is yes. Some specific features of Hadoop enable to prepare data for future analysis fast and at a moderate cost. In addition, this technology can work with unstructured data. However, I do not think it will happen very soon. There are many OLAP systems and they solve their tasks, doing it better or worse. In general enterprises are very reluctant to change something. In addition, replacing the OLAP tools requires additional investments. The good news is that we don’t have to choose one or another. Hadoop can be used as a pre-processor of the data for OLAP analysis. And analysts can work with the familiar tools.
Q9. How do you categorize the various stages of the Hadoop usage in the enterprises?
Khasanshyn: I would name the following stages of Hadoop usage:
1. Development of prototypes to check out whether Hadoop is suitable for their tasks
2. Using Hadoop in combination with other tools for storing and processing data of some business units
3. Implementation of a centralized enterprise data storage system and gradual integration of all business units into it
Q10. Data Warehouse vs Big “Data Lake”. What are the similarities and what are the differences?
Khasanshyn: Even though Big “Data Lake” is a metaphor, I do not really like it.
This name highlights that it is something limited, isolated from other systems. I would better call this concept a “Big Data Ocean”, because the data added to the system can interact with the surrounding systems. In my opinion, data warehouses are a bit outdated. At the earlier stage, such systems enabled companies to aggregate a lot of data in one central storage and arrange this data. All this was done with the acceptable speed. Now there are a lot of cheap storage solutions, so we can keep enormous volumes of data and process it much faster than with data warehouses.
The possibility to store large volumes of data is a common feature of data warehouses and a Big “Data Lake”. With a flexible architectures and broad capabilities for data analysis and discovery, a Big “Data Lake” provides a wider range of business opportunities. A modern company should adjust to changes very fast. The structure that was good yesterday may become a limitation today.
Q11. In your opinion, is there a technology which is best suited to build a Big Data Analytics Data Platform? If yes, which one?
Khasanshyn: As I have already said, there is no “magic bullet” that can cure every disease. There is no “Universal Big Data Analytics Data Platform” fitting each size; everything depends on the requirements of a particular business. A system of this kind should have the following features:
● A Hadoop-based system for storing and processing data
● An operational database that contains the most recent data, it can be raw data, that should be analyzed.
A NoSQL solution can be used for this case.
● A database for storing predicted indicators. Most probably, it should be a relational data store.
● A system that allows for creating data analysis algorithms. The R language can be used to complete this task.
● A report building system that provides access to data. For instance, there such good options like Tableau or Pentaho.
Q12. What about elastic computing in the Cloud? How does it relate to Big Data Analytics?
Khasanshyn: In my opinion, cloud computing became the force that raised a Big Data wave. Elastic computing enabled us to use the amount of computation resources we need and also reduced the cost of maintaining a large infrastructure.
There is a connection between elastic computing and big data analytics. For us it is quite a typical case that we have to process data from time to time, for instance, once a day. To solve this task, we can deploy a new cluster or just scale up an existing Hadoop cluster in the cloud environment. We can temporary increase the speed of data processing by scaling a Hadoop cluster. The task will be executed faster and after that we can stop the Hadoop cluster or reduce its size. I can even say, that cloud technologies is a must have component for a Big Data analysis system.
Renat Khasanshyn, Founder and Chairman, Altoros.
Renat is founder & CEO of Altoros, and Venture Partner at Runa Capital. Renat helps define Altoros’s strategic vision, and its role in Big Data, Cloud Computing, and PaaS ecosystem. Renat is a frequent conference and meetup speaker on this topic.
Under his supervision Altoros has been servicing such innovative companies as Couchbase, RightScale, Canonical, DataStax, Joyent, Nephoscale, and NuoDB.
In the past, Renat has been selected as finalist for the Emerging Executive of the Year award by the Massachusetts Technology Leadership Council and once won an IBM Business Mashup Challenge. Prior to founding Altoros, Renat was VP of Engineering for Tampa-based insurance company PriMed. Renat is also founder of Apatar, an open source data integration toolset, founder of Silicon Valley NewSQL User Group and co-founder of the Belarusian Java User Group.
– ODBMS.org free resources on Big Data and Analytical Data Platforms:
Blog Posts | Free Software| Articles | Lecture Notes | PhD and Master Thesis|
– ODBMS.org free resources on Cloud Data Stores:
Blog Posts | Lecture Notes| Articles and Presentations| PhD and Master Thesis|
Follow ODBMS.org on Twitter: @odbmsorg