ODBMS Industry Watch » Cassandra http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Fri, 09 Feb 2018 15:17:45 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.19 On Big Data and Data Science. Interview with James Kobielus http://www.odbms.org/blog/2016/04/on-big-data-and-data-science-interview-with-james-kobielus/ http://www.odbms.org/blog/2016/04/on-big-data-and-data-science-interview-with-james-kobielus/#comments Tue, 19 Apr 2016 08:34:09 +0000 http://www.odbms.org/blog/?p=4119

“One of the most typical mistakes in large-scale data projects is losing sight of the biases that may skew the insights you extract.”– James Kobielus

On the topics of Big Data, and Data Science, I have interviewed James Kobielus, IBM Big Data Evangelist.

RVZ

Q1. What kind of companies generate Big Data, besides the Internet giants?

James Kobielus: Big data isn’t something you “generate.” Rather, the term refers to the ability to achieve differentiated value from advanced analytics on trustworthy data at any scale. In other words, it’s a best practice, not a specific type of data or even a specific scale of data (measured in volume, velocity, and/or variety).

When considered in this light, you can identify big data analytic applications in every industry. Every C-level executive has strategic applications of big data. Here are just a smattering:

  • Chief Marketing Officers have been the prime movers on many big data initiatives that involve Hadoop, NoSQL, and other approaches. Their primary applications consist of marketing campaign optimization, customer churn and loyalty, upsell and cross-sell analysis, targeted offers, behavioral targeting, social media monitoring, sentiment analysis, brand monitoring, influencer analysis, customer experience optimization, content optimization, and placement optimization
  • Chief Information Officers use big data platforms for data discovery, data integration, business analytics, advanced analytics, exploratory data science.
  • Chief Operations Officers rely on big data for supply chain optimization, defect tracking, sensor monitoring, and smart grid, among other applications.
  • Chief Information Security Officer run security incident and event management, anti-fraud detection, and other sensitive applications on big data.
  • Chief Technology Officers do IT log analysis, event analytics, network analytics, and other systems monitoring, troubleshooting, and optimization applications on big data.
  • Chief Financial Officers run complex financial risk analysis and mitigation modeling exercises on big data platforms.

Q2. What are the most challenging problems you are facing when analysing Big Data?

James Kobielus: Searching for actionable intelligence in big data involves building and testing advanced-analytics models against large volumes of complex data that may be flowing in at high velocities.

At these scales, it’s easy to get overwhelmed in your analysis unless you automate the end-to-end processes of extracting intelligence at scale. Automation can also help control the cost of managing a growing volume of algorithmic models against ever expanding big-data collections. The key processes that need automating are data discovery, profiling, sampling, and preparation, as well as model building, scoring, and deployment.

Q3. How do you typically handle them?

James Kobielus: Automating the modeling process will boost data scientist productivity by an order of magnitude, freeing them from drudgery so that they can focus on the sorts of exploration, modeling, and visualization challenges that demand expert human judgment. Data scientists can accelerate their modeling automation initiatives by following these steps:

  • Virtualize access to data, metadata, rules, and predictive models, as well as to data integration, data warehousing, and advanced analytic applications through a BI semantic virtualization layer;
  • Unify access, governance, orchestration, automation, and administration across these resources within a service-oriented architecture;
  • Explore commercial tools that support maximum automation of model development, scoring, deployment, and execution;
  • Consolidate, accelerate, and deepen predictive analytics through integration into big-data platforms with scalable in-database execution; and
  • Migrate existing analytical data marts into multidomain big-data platforms with unified data, metadata, and model governance within service-oriented virtualization framework.

Q4. What are in your experience the typical mistakes made in large scale data projects?

James Kobielus: One of the most typical mistakes in large-scale data projects is losing sight of the biases that may skew the insights you extract.

Even if you accept that a data scientist’s integrity is rock-solid, intentions pure, skills stellar, and discipline rigorous, there’s no denying that bias may creep inadvertently into their work with big data. The biases may be minor or major, episodic or systematic, tangential or material to their findings and recommendations. Whatever their nature, the biases must be understood and corrected as fully as possible.

Here are some of the key sources of bias that may crop up in a data scientist’s work with big data:

  • Cognitive bias: This is the tendency to make skewed decisions based on pre-existing cognitive and heuristic factors–such as a misunderstanding of probabilities–rather than on the data and other hard evidence. You might say that the educated intuition that drives data science is rife with cognitive bias, but that’s not always a bad thing.
  • Selection bias: This is the tendency to skew your choice of data sources to those that may be most available, convenient, and cost-effective for your purposes, as opposed to being necessarily the most valid and relevant for your study. Clearly, data scientists do not have unlimited budgets, may operate under tight deadlines, and don’t use data for which they lack authorization. These constraints may introduce an unconscious bias in the big-data collections they are able to assemble.
  • Sampling bias: This is the tendency to skew the sampling of data sets toward subgroups of the population most relevant to the initial scope of a data-science project, thereby making it unlikely that you will uncover any meaningful correlations that may apply to other segments. Another source of sampling bias is “data dredging,” in which the data scientist uses regression techniques that may find correlations in samples but that may not be statistically significant in the wider population. Consequently, you’re likely to spuriously confirm your initial model for the segments that happen to make the sampling cut.
  • Modeling bias: Beyond the biases just discussed, this is the tendency to skew data-science models by starting with a biased set of project assumptions that drive selection of the wrong variables, the wrong data, the wrong algorithms, and the wrong metrics of fitness. In addition, overfitting of models to past data without regard for predictive lift is a common bias. Likewise, failure to score and iterate models in a timely fashion with fresh observational data also introduces model decay, hence bias.
  • Funding bias: This may be the most silent but pernicious bias in data-scientific studies of all sorts. It’s the unconscious tendency to skew all modeling assumptions, interpretations, data, and applications to favor the interests of the party–employer, customer, sponsor, etc.–that employs or otherwise financially supports the data-science initiative. Funding bias makes it highly unlikely that data scientists will uncover disruptive insights that will “break the rice bowl” in which they make their living.

Q5. How do you measure “success” when analysing data?

James Kobielus: You measure success in your ability to distill useful insights in a timely fashion from the data at your disposal.

Q6. What skills are required to be an effective Data Scientist?

James Kobielus: Data science’s learning curve is formidable. To a great degree, you will need a degree, or something substantially like it, to prove you’re committed to this career. You will need to submit yourself to a structured curriculum to certify you’ve spent the time, money and midnight oil necessary for mastering this demanding discipline.

Sure, there are run-of-the-mill degrees in data-science-related fields, and then there are uppercase, boldface, bragging-rights “DEGREES.” To some extent, it matters whether you get that old data-science sheepskin from a traditional university vs. an online school vs. a vendor-sponsored learning program. And it matters whether you only logged a year in the classroom vs. sacrificed a considerable portion of your life reaching for the golden ring of a Ph.D. And it certainly matters whether you simply skimmed the surface of old-school data science vs. pursued a deep specialization in a leading-edge advanced analytic discipline.

But what matters most to modern business isn’t that every data scientist has a big honking doctorate. What matters most is that a substantial body of personnel has a common grounding in core curriculum of skills, tools and approaches. Ideally, you want to build a team where diverse specialists with a shared foundation can collaborate productively.

Big data initiatives thrive if all data scientists have been trained and certified on a curriculum with the following foundation:

  • Paradigms and practices: Every data scientist should acquire a grounding in core concepts of data science, analytics and data management. They should gain a common understanding of the data science lifecycle, as well as the typical roles and responsibilities of data scientists in every phase. They should be instructed on the various role(s) of data scientists and how they work in teams and in conjunction with business domain experts and stakeholders. And they learn a standard approach for establishing, managing and operationalizing data science projects in the business.
  • Algorithms and modeling: Every data scientist should obtain a core understanding of linear algebra, basic statistics, linear and logistic regression, data mining, predictive modeling, cluster analysis, association rules, market basket analysis, decision trees, time-series analysis, forecasting, machine learning, Bayesian and Monte Carlo Statistics, matrix operations, sampling, text analytics, summarization, classification, primary components analysis, experimental design, unsupervised learning constrained optimization.
  • Tools and platforms: Every data scientist should master a core group of modeling, development and visualization tools used on your data science projects, as well as the platforms used for storage, execution, integration and governance of big data in your organization. Depending on your environment, and the extent to which data scientists work with both structured and unstructured data, this may involve some combination of data warehousing, Hadoop, stream computing, NoSQL and other platforms. It will probably also entail providing instruction in MapReduce, R and other new open-source development languages, in addition to SPSS, SAS and any other established tools.
  • Applications and outcomes: Every data scientist should learn the chief business applications of data science in your organization, as well as in how to work best with subject-domain experts. In many companies, data science focuses on marketing, customer service, next best offer, and other customer-centric applications. Often, these applications require that data scientists understand how to leverage customer data acquired from structured survey tools, sentiment analysis software, social media monitoring tools and other sources. It also essential that every data scientist gain an understanding of the key business outcomes–such as maximizing customer lifetime value–that should focus their modeling initiatives.

Classroom instruction is important, but a curriculum that is 100 percent devoted to reading books, taking tests and sitting through lectures is insufficient. Hands-on laboratory work is paramount for a truly well-rounded data scientist. Make sure that your data scientists acquire certifications and degrees that reflect them actually developing statistical models that use real data and address substantive business issues.

A business-oriented data-science curriculum should produce expert developers of statistical and predictive models. It should not degenerate into a program that produces analytics geeks with heads stuffed with theory but whose diplomas are only fit for hanging on the wall.

Q7. Hadoop vs. Spark: what are the pros and cons?

James Kobielus: Big data analytics infrastructures are growing more hybridized than ever. Every new technology—such as Hadoop, in-memory databases, and graph databases—finds its specific niche in terms of use cases, deployment modes, and applications for which it is best suited.

Even as Apache Spark pushes more deeply into big-data environments, it won’t substantially change this trend. Yes, of course Spark is on the fast track to ubiquity in big-data analytics. This is especially true for the next generation of machine-learning applications that feed on growing in-memory pools and require low-latency distributed computations for streaming and graph analytics. But those use cases aren’t the sum total of big-data analytics and never will be.

As we all grow more infatuated with Spark, it’s important to continually remind ourselves of what it’s not suitable for. If, for example, one considers all the critical data management, integration, and preparation tasks that must be performed prior to modeling in Spark, it’s clear that these will not be executed in any of the Spark engines (Spark SQL, Spark Streaming, GraphX). Instead, they’ll be carried out in the data platforms and elastic clusters (HDFS, Cassandra, HBase, Mesos, cloud services, etc.) upon which those engines run. Likewise, you’d be hardpressed to find anyone who’s seriously considering Spark in isolation for data warehousing, data governance, master data management, or operational business intelligence.

Above all else, Spark is the new power tool for data scientists who are pushing boundaries in the emerging era of in-memory big data analytics in low-latency scenarios of all types. Spark is proving its value as a development tool for the new generation of data scientists building the in-memory statistical models upon which it all will depend.

Let’s not fall into the delusion that everything is converging toward Spark, as if it were the ravenous maw that will devour every other big-data analytics tool and platform. Spark is just another approach that’s being fitted to and optimized for specific purposes.

And let’s resist the hype that treats Spark as Hadoop’s “successor.” This implies that Hadoop and other big-data approaches are “legacy,” rather than what they are, which is foundational. For example, no one is seriously considering doing “data lakes,” “data reservoirs,” or “data refineries” on anything but Hadoop or NoSQL.

——————–

James Kobielus is an industry veteran and serves as IBM Big Data Evangelist; Senior Program Director for Product Marketing in Big Data Analytics; and Team Lead, Technical Marketing, IBM Big Data & Analytics Hub. He spearheads thought leadership activities across the IBM Analytics solution portfolio. He has spoken at such leading industry events as IBM Insight, Hadoop Summit, and Strata. He has published several business technology books and is a very popular provider of original commentary on blogs and many social media.

Resources

–  Master of Information and Data Science,  UC Berkeley School of Information.

– MS in Data Science, NYU Center for Data Science.

– Free data science curriculum, kdnuggets.com

Data Science | Coursera

– Master of Science in Data Science – Data Science Institute

Data Mining and Applications Graduate Certificate, Stanford

The European Data Science Academy (EDSA) designs curricula for data science training and data science education across the European Union (EU).

-The EDISON project will focus on activities to establish the new profession of ‘Data Scientist’, following the emergence of Data Science technologies (also referred to as Data Intensive or Big Data technologies) which changes the way research is done, how scientists think and how the research data are used and shared. This includes definition of the required skills, competences framework/profile, corresponding Body Of Knowledge and model curriculum. It will develop a sustainability/business model to ensure a sustainable increase of Data Scientists, graduated from universities and trained by other professional education and training institutions in Europe. 
EDISON will facilitate the establishment of a Data Science education and training infrastructure at major European universities by promoting experience of ‘champion’ universities involving them into coordinated development and implementation of the model curriculum and creation of cooperative educational and training infrastructure.

Related Posts

– RIP Big Data, By Carl Olofson, Research Vice President, Data Management Software Research, IDC. ODBMS.org, January  2016

Open Source Software and IBM’s Big Data platform. By Cynthia M. Saracco, senior solutions architect at IBM’s Silicon Valley Laboratory. ODBMS.org, April 2016.

Looking back at Big Data in 2015, By Cynthia M. Saracco, IBM Senior Solution Architect, ODBMS.org. November 2015

–  Heuristics for a Data Scientist: A common sense approach. BY Silvia Dassiè, Data Scientist at Ryanair. ODBMS.org, December 2015

The rise of immutable data stores. By Alan Morrison, Senior Manager, PwC Center for technology and innovation. ODBMS.org. October 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/04/on-big-data-and-data-science-interview-with-james-kobielus/feed/ 0
Predictive Analytics in Healthcare. Interview with Steve Nathan http://www.odbms.org/blog/2014/08/steve-nathan-ceo-amara-health-analytics/ http://www.odbms.org/blog/2014/08/steve-nathan-ceo-amara-health-analytics/#comments Tue, 26 Aug 2014 04:07:01 +0000 http://www.odbms.org/blog/?p=3388

“Analysis of big data can identify the subtle differences that explain why similar-seeming patients have different outcomes, and predictive decision support can help physicians guide more patients on the path to recovery”–Steve Nathan.

Why using predictive analytics in healthcare? On this topic, I have interviewed Steve Nathan, CEO at Amara Health Analytics.

RVZ

Q1. Amara provides real-time predictive analytics to support clinicians in the early detection of critical disease states. Can you tell us a bit more of what are these predictive analytics and which data sets do you use?

Steve Nathan: Our system runs on hospital data, including labs, pharmacy, real-time vitals, ADT, EMR, and clinical narrative text. We have developed domain-specific natural language processing (NLP) and machine learning to find predictive signal that is beyond the reach of traditional approaches to clinical analytics.

Q2. How large are the data sets you analyse?

Steve Nathan: For an average 500 bed hospital we analyse about 100 million data items in real-time annually. We expect this volume of data to increase dramatically as more centers deploy hospital-wide automated real-time streaming of data from patient monitors into the EMR.

Q3. Why early detection is important?

Steve Nathan: Let’s look at the example of sepsis, which is the body’s systemic toxic response to an infection. Clinicians know how to treat sepsis once identified, but it’s often difficult to identify early because it can mimic other conditions and there are a multitude of clinical variables involved, some of which are subjective in nature.
At the later stages of sepsis, mortality increases almost 8% with every hour of delayed treatment. So technology that can assist clinicians in early identification is important.

Q4. What is your back-end platform for real-time data processing and analytics?

Steve Nathan: The major components of the backend are Mirth (open source HL7 parsing/transform), a data aggregator, a real-time data streaming engine, Jess rules engine, the core analytics engine (including NLP pipeline), and Cassandra NoSQL database from DataStax.

Q5 What are the main technical challenges you encounter when you analyse big data sets that are composed of previous patient histories and medical records?

Steve Nathan: Input data comes from a wide variety of centers and is wildly heterogeneous. Issues range from proprietary health IT systems and incompatible usage of available standards like HL7, to wide variations in clinical language used by physicians and nurses in their notes, to varying patterns of diagnostic testing by different providers. And of course every patient is unique. So much of our system is dedicated to producing a standardized timeline for each patient in which every variable has a consistent interpretation, no matter where it originally came from. We then use these patient timelines as input for machine learning, and there are some unique challenges there in developing predictive models that are appropriate for use in real-time decision support.

Q6. Why did you choose a NoSQL database for this task?

Steve Nathan: The primary motivation was flexibility of data representation. We wanted to be able to change schemas dynamically. Also, DataStax’s integration of Cassandra with Solr was very important for us because we do a lot text mining as part of our overall approach.

Q7. What are the lessons learned so far?

Steve Nathan: Processing huge amounts of historical patient data in simulations to refine predictive models is very challenging to do with good performance. We must be able to run many years worth of archived data through the system in a small number of hours, so the system architecture must be designed with this processing mode in mind.
It is important to consider early on how to upgrade the live system with no down time while avoiding the possibility of losing any data.

Q8. What about data protection issues?

Steve Nathan: We comply with HIPAA regulations and go to great lengths to protect data. This includes work processes and policies for all employees, contractors, and business associates. And, importantly, it includes the architecture and deployment of our systems – for example all data is transferred via VPN, and every VPN connection is terminated in a VM that contains a single protected dataset, rather than terminating at a DC router.

Q9. Which relationships exist between Amara and UC San Diego?

Steve Nathan: There is no formal relationship between Amara and UCSD. Amara’s co-founder and chairman, Dr. Ramamohan Paturi, is a professor of Computer Science at UCSD.

Qx. Anything else you wish to add?

Steve Nathan: Analysis of big data can identify the subtle differences that explain why similar-seeming patients have different outcomes, and predictive decision support can help physicians guide more patients on the path to recovery.

—————————————-
S.Nathan-5X7_300dpi
Steve Nathan, CEO, Amara Health Analytics

Steve Nathan has over 25 years in the enterprise software industry. As CEO of Parity Computing beginning in 2008, Steve led the company’s product expansion with a pioneering analytics platform for enhancing biomedical research productivity. He then led the formation of Amara Health Analytics, the concept and strategy for the Clinical Vigilance™ product line, and the spinout of Amara as an independent company.

Steve has held key leadership positions at recognized technology innovators Sun Microsystems and Cray Research, as well as at start-ups Celerity Computing, Alignent Software, and Exist Global. As General Manager at Sun Microsystems, he had P&L responsibility for messaging, portal, and web infrastructure in the iPlanet business unit, where he grew the annual revenue of this product line from $15M to $150M in three years.

Steve holds B.S. degrees from the University of California at Riverside in Computer Science, Mathematics, and Psychology; and he is a 1999 graduate of the Stanford University Business Executive Program.

Resources

U.S. Department of Health & Human Services

DataStax Enterprise Reference Architecture, White Paper, BY DATASTAX CORPORATION January 2014

Related Posts

Big Data: three questions to DataStax. ODBMS Industry Watch,April 7, 2014

Follow ODBMS.org and ODBMS Industry Watch on Twitter:
@odbmsorg

##

]]>
http://www.odbms.org/blog/2014/08/steve-nathan-ceo-amara-health-analytics/feed/ 0
NoSQL for the Internet of Things. Interview with Mike Williams. http://www.odbms.org/blog/2014/06/mike-williams/ http://www.odbms.org/blog/2014/06/mike-williams/#comments Thu, 05 Jun 2014 17:04:40 +0000 http://www.odbms.org/blog/?p=3186

“The Internet of Things is a good fit for NoSQL technologies, as you face the challenge of dealing with huge volumes of data over time. For businesses that wish to scale their IoT implementations and make use of the data that these networks create, NoSQL solutions are a better fit than RDBMS options.”–Mike Williams

I have interviewed Mike Williams, software director for i20 Water. Mike has an interesting use case for NoSQL in the area of water distribution networks. We also discussed how NoSQL can be used for the Internet of Things.
RVZ

Q1. What is the business of i20 Water? 

Mike Williams: i2O is the world’s leading developer of Smart Pressure Management solutions for water distribution networks.

Q2. What are the main benefits for utility companies? 

Mike Williams: i2O’s Smart Pressure Management solutions optimise the performance of water distribution networks through improving network visibility, and enabling the remote control and automatic optimisation of network pressures.
These technology-enabled best practices deliver benefits in six key areas, with customers typically achieving return on investment in 6-18 months.
The opportunities for savings fall into two areas: on the network side, we see leakage reduction, energy savings and a big reduction in burst pipes.
For our utility customers, there are business-level returns as well based on improved customer service and operational cost savings. We also see customers being able to extend the life of their assets across the network as well, so they see a real long-term benefit to being able to control water pressure more accurately.

Q3. Do you have any metrics to share with us on the estimated volume of water saved by using your software? 

Mike Williams: The best metric we can give is the simple volume of water that we help customers save: we currently help our customers to save over 235 Million Litres of water every day.

Q4. How can big data be used to reduce water leakage? 

Mike Williams: The i2O system monitors and controls water pressure throughout a zone or network. This enables water companies to fully optimise water pressures remotely and automatically to meet agreed customer service levels throughout the network. The Big Data is all of this time-series data of pressures and flows (and other metrics) for lots of locations over years’ worth of points.
The solution continuously learns key characteristics within a Zone and then automatically controls the pressure within the Zone to achieve a stable target pressure at the critical point. This is achieved through a sophisticated mathematical algorithm, which automatically generates a control model. The control model is supplied ‘over the air’ from i2O software to the Pressure Reducing Valve (PRV) controller.
The control model is automatically updated if the software detects a significant change in the head loss characteristics. This ensures that no more pressure than is required enters the network on an ongoing basis. The i2O Automatic Optimisation solution is the world’s first – and most widely deployed – system for automatically optimising and remotely controlling water pressure in your network.
i2O’s Automatic PRV Optimisation has delivered significant results in hundreds of zones for major water companies worldwide.

Q5. Could you please describe your IT back-end infrastructure? What are  the main challenges you have? 

Mike Williams: We have an eco-system of distributed loosely coupled Services, each with access to numerous dedicated and tailored data stores.
These services collaborate through an Event-Driven Architecture (EDA) and a distributed Event Broker to deliver business services to our customers.
The main challenges are around the scalability of these services and the data stores where the vast amount of time-series data are held.
Due to the way we have architected our data, we have the ability to replay these events over history when we make changes to our products and services – we can use this historical data to analyse how devices on our networks respond to changes in circumstances, and measure what difference our new features would provide.

Q6. Why did you make the shift to NoSQL? What were you doing previously? 

Mike Williams: The challenge was the overall scale–up of the whole platform. From both a technical and a business stand-point, we needed to scale up for the next years. Based on the number of devices that our customers had in place, and the number of new customers that we were projected to win, our RDBMS was not able to cope by design. Storing time-series data is a specialist need that column-oriented data stores are better suited to than traditional RDBMS row-oriented technologies. We were (and for some customers, still are) using MS SQL Server as the only data store. NoSQL technologies also better fit our data modelling needs such as searching, where we employ the ElasticSearch NoSQL solution in a clustered manner.

Q7. What is your experience of using a NoSQL database so far? 

Mike Williams:  So far it has been very positive. We had a learning curve to go up and the technologies themselves (Cassandra and ElasticSearch) have matured greatly in the last 18 months that we have been using them.

Q8. What does the future hold for Cassandra in your organisation? How are  you using other database types as well? 

Mike Williams: We moved over to using Cassandra as our main data store for time-series data – this was because it provides better support for columnar data, as well as meeting the requirement that we had around scale. We are committed to Cassandra for the foreseeable future and so it’s future is to grow alongside our business. We also use ElasticSearch as mentioned, as well as PostgresSQL when our specific data models dictate the use of tabular, relational data.

Q9. Do you work on the cross-over with the Internet of Things? 

Mike Williams:  Indeed we do as we produce intelligent devices that communicate with our Platform over the Internet (GPRS mobile network). The devices automatically sample for pressure, temperature and other information that can then be compared across the network.

The Internet of Things is a good fit for NoSQL technologies, as you face  the challenge of dealing with huge volumes of data over time. For businesses that wish to scale their IoT implementations and make use of the data that these networks create, NoSQL solutions are a better fit than RDBMS options.
The ability to capture information from across our network has two key value propositions: the first is for our customers right now, as they can manage their water pressure more effectively. The second is the long term value that the data can provide. By being able to model and re-use historical data, we can offer much more value to customers than they can achieve by themselves. We can add new features to our platform, and demonstrate how these new features can provide greater opportunities to save money and water for customers.

———————

Mike Williams is the software director for i20 Water. He has 25 years of experience working for innovative high-tech companies in finance, payments, process engineering and the environment, where he has focused on solving problems and translating business challenges into tangible technical solutions. Before joining i2O Water Mike was the Chief Software Architect and head of development for Bottomline Technologies, the leading supplier of banking transaction and payments software.
Mike is also an Agile Coach and has helped transform numerous businesses as they become Agile in their approaches to business as a whole and not just software development. Mike is the organiser and founder of the Agile South Coast group.

Related Posts

Big Data and NoSQL: Interview with Joe Celko. ODBMS Industry Watch, February 20, 2014.

Resources
ODBMS.org: NOSQL DATA STORES
Free Downloads for:
Free Software
Articles Papers Presentations
Documentations,Tutorials,Lecture Notes
PhD and Master Thesis

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2014/06/mike-williams/feed/ 0
On SQL and NoSQL. Interview with Dave Rosenthal http://www.odbms.org/blog/2014/03/dave-rosenthal/ http://www.odbms.org/blog/2014/03/dave-rosenthal/#comments Tue, 18 Mar 2014 16:04:45 +0000 http://www.odbms.org/blog/?p=2932

“Despite the obvious shared word ‘transaction’ and the canonical example of a database transaction which modifies multiple bank accounts, I don’t think that database transactions are particularly relevant to financial applications.”–Dave Rosenthal.

On SQL and NoSQL, I have interviewed Dave Rosenthal CEO of FoundationDB.

RVZ

Q1. What are the suggested criteria for users when they need to choose between durability for lower latency, higher throughput and write availability?

Dave Rosenthal: There is a tradeoff in available between commit latency and durability–especially in distributed databases. At one extreme a database client can just report success immediately (without even talking to the database server) and buffer the writes in the background. Obviously, that hides latency well, but you could lose a suffix of transactions. At the other extreme, you can replicate writes across multiple machines, fsync them on each of the machines, and only then report success to the client.

FoundationDB is optimized to provide good performance in its default setting, which is the safest end of that tradeoff.

Usually, if you want some reasonable amount of durability guarantee, you are talking about a commit latency of small constant factor times the network latency. So, the real latency issues come with databases spanning multiple data centers. In that case FoundationDB users are able to choose whether or not they want durability guarantees in all data centers before commit (increasing commitment latencies), which is our default setting, or whether they would like to relax durability guarantees by returning a commit when the data is fsync’d to disk in just one datacenter.

All that said, in general, we think that the application is usually a more appropriate place to try to hide latency than the database.

Q2. Justin Sheehy of Basho in an interview said [1] “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense”. What is your opinion on this?

Dave Rosenthal: Yes, we totally agree with Justin. Despite the obvious shared word ‘transaction’ and the canonical example of a database transaction which modifies multiple bank accounts, I don’t think that database transactions are particularly relevant to financial applications. In fact, true ACID transactions are way more broadly important than that. They give you the ability to build abstractions and systems that you can provide guarantees about.
As Michael Cahill says in his thesis which became the SIGMOD paper of the year: “Serializable isolation enables the development of a complex system by composing modules, after verifying that each module maintains consistency in isolation.” It’s this incredibly important ability to compose that makes a system with transactions special.

Q3. FoundationDB claim to provide full ACID transactions. How do you do that?

Dave Rosenthal: In the same basic way as many other transactional databases do. We use a few strategies that tend to work well in distributed system such as optimistic concurrency and MVCC. We also, of course, have had to solve some of the fundamental challenges associated with distributed systems and all of the crazy things that can happen in them. Honestly, it’s not very hard to build a distributed transactional database. The hard part is making it work gracefully through failure scenarios and to run fast.

Q4. Is this similar to Oracle NoSQL?

Dave Rosenthal: Not really. Both Oracle NoSQL and FoundationDB provide an automatically-partitioned key-value store with fault tolerance. Both also have a concept of ordering keys (for efficient range operations) though Oracle NoSQL only provides ordering “within a Major Key set”. So, those are the similarities, but there are a bunch of other NoSQL systems with all those properties. The huge difference is that FoundationDB provides for ACID transactions over arbitrary keys and ranges, while Oracle NoSQL does not.

Q5. How would you compare your product offering with respect to NoSQL data stores, such as CouchDB, MongoDB, Cassandra and Riak, and NewSQL such as NuoDB and VoltDB?

Dave Rosenthal: The most obvious response for the NoSQL data stores would be “we have ACID transactions, they don’t”, but the more important difference is in philosophy and strategy.

Each of those products expose a single data model and interface. Maybe two. We are pursuing a fundamentally different strategy.
We are building a storage substrate that can be adapted, via layers, to provide a variety of data models, APIs, and true flexibility.
We can do that because of our transactional capabilities. CouchDB, MongoDB, Cassandra and Riak all have different APIs and we talk to companies that run all of those products side-by-side. The NewSQL database players are also offering a single data model, albeit a very popular one, SQL. FoundationDB is offering an ever increasing number of data models through its “layers”, currently including several popular NoSQL data models and with SQL being the next big one to hit. Our philosophy is that you shouldn’t have to increase the complexity of your architecture by adopting a new NoSQL database each time your engineers need access to a new data model.

Q6. Cloud computing and open source: How does it relate to FoundationDB?

Dave Rosenthal: Cloud computing: FoundationDB has been designed from the beginning to run well in cloud environments that make use of large numbers of commodity machines connected through a network. Probably the most important aspect of a distributed database designed for cloud deployment is exceptional fault tolerance under very harsh and strange failure conditions – the kind of exceptionally unlikely things that can only happen when you have many machines working together with components failing unpredictably. We have put a huge amount of effort into testing FoundationDB in these grueling scenarios, and feel very confident in our ability to perform well in these types of environments. In particular, we have users running FoundationDB successfully on many different cloud providers, and we’ve seen the system keep its guarantees under real-world hardware and network failure conditions experienced by our users.

Open source: Although FoundationDB’s core data storage engine is closed source, our layer ecosystem is open source. Although the core data storage engine has a very simple feature set, and is very difficult to properly modify while maintaining correctness, layers are very feature rich and because they are stateless, are much easier to create and modify which makes them well suited to third-party contributions.

Q7 Pls give some examples of use cases where FoundationDB is currently in use. Is FoundationDB in use for analyzing Big Data as well?

Dave Rosenthal: Some examples: User data, meta data, user social graphs, geo data, via ORMs using the SQL layer, metrics collection, etc.

We’ve mostly focused on operational systems, but a few of our customers have built what I would call “big data” applications, which I think of as analytics-focused. The most common use case has been for collecting and analyzing time-series data. FoundationDB is strongest in big data applications that call for lots of random reads and writes, not just big table scans—which many systems can do well.

Q8. Rick Cattel said in an recent interview [2] “there aren’t enough open source contributors to keep projects competitive in features and performance, and the companies supporting the open source offerings will have trouble making enough money to keep the products competitive themselves”. What is your opinion on this?

Dave Rosenthal: People have great ideas for databases all the time. New data models, new query languages, etc.
If nothing else, this NoSQL experiment that we’ve all been a part of the past few years has shown us all the appetite for data models suited to specific problems. They would love to be able to build these tools, open source them, etc.
The problem is that the checklist of practical considerations for a database is huge: Fault tolerance, scalability, a backup solution, management and monitoring, ACID transactions, etc. Add those together and even the simplest concept sounds like a huge project.

Our vision at FoundationDB is that we have done the hard work to build a storage substrate that simultaneously solves all those tricky practical problems. Our engine can be used to quickly build a database layer for any particular application that inherits all of those solutions and their benefits, like scalability, fault tolerance and ACID compliance.

Q9. Nick Heudecker of Gartner, predicts that [3] “going forward, we see the bifurcation between relational and NoSQL DBMS markets diminishing over time” . What is your take on this?

Dave Rosenthal: I do think that the lines between SQL and NoSQL will start to blur and I believe that we are leading that charge.We acquired another database startup last year called Akiban that builds an amazing SQL database engine.
In 2014 we’ll be bringing that engine to market as a layer running on top of FoundationDB. That will be a true ANSI SQL database operating as a module directly on top of a transactional “NoSQL” engine, inheriting the operational benefits of our core storage engine – scalability, fault tolerance, ease of operation.

When you run multiple of the SQL layer modules, you can point many of them at the same key-space in FoundationDB and it’s as if they are all part of the same database, with ACID transactions enforced across the separate SQL layer processes.
It’s very cool. Of course, you can even run the SQL layer on a FoundationDB cluster that’s also supporting other data models, like graph or document. That’s about as blurry as it gets.

———–
Dave Rosenthal is CEO of FoundationDB. Dave started his career in games, building a 3D real-time strategy game with a team of high-school friends that won the 1st annual Independent Games Festival. Previously, Dave was CTO at Visual Sciences, a pioneering web-analytics company that is now part of Adobe. Dave has a degree in theoretical computer science from MIT.

Related Posts
Operational Database Management Systems. Interview with Nick Heudecker, ODBMS Industry Watch December 16, 2013

Follow ODBMS.org on Twitter: @odbmsorg

 

]]>
http://www.odbms.org/blog/2014/03/dave-rosenthal/feed/ 0
On multi-model databases. Interview with Martin Schönert and Frank Celler. http://www.odbms.org/blog/2013/10/on-multi-model-databases-interview-with-martin-schonert-and-frank-celler/ http://www.odbms.org/blog/2013/10/on-multi-model-databases-interview-with-martin-schonert-and-frank-celler/#comments Mon, 28 Oct 2013 09:21:59 +0000 http://www.odbms.org/blog/?p=2743

“We want to prevent a deadlock where the team is forced to switch the technology in the middle of the project because it doesn’t meet the requirements any longer.”–Martin Schönert and Frank Celler.

On “multi-model” databases, I have interviewed Martin Schönert and Frank Celler, founders and creators of the open source ArangoDB.

RVZ

Q1. What is ArangoDB and for what kind of applications is it designed for?

Frank Celler: ArangoDB is a multi-model mostly-memory database with a flexible data model for documents and graphs. It is designed as a “general purpose database”, offering all the features you typically need for modern web applications.

ArangoDB is supposed to grow with the application—the project may start as a simple single-server prototype, nothing you couldn’t do with a relational database equally well. After some time, some geo-location features are needed and a shopping cart requires transactions. ArangoDB’s graph data model is useful for the recommendation system. The smartphone app needs a lean API to the back-end—this is where Foxx, ArangoDB’s integrated Javascript application framework, comes into play.
The overall idea is: “We want to prevent a deadlock where the team is forced to switch the technology in the middle of the project because it doesn’t meet the requirements any longer.”

ArangoDB is open source (Apache 2 licence)—you can get the source code at GitHub or download the precompiled binaries from our website.

Though ArangoDB as a universal approach, there are edge cases where we don’t recommend ArangoDB. Actually, ArangoDB doesn’t compete with massively distributed systems like Cassandra with thousands of nodes and many terabytes of data.

Q2. What’s so special about the ArangoDB data model?

Martin Schönert: ArangoDB is a multi-model database. It stores documents in collections. A specialized binary data file format is used for disk storage. Documents that have similar structure (i.e., that have the same attribute names and attribute types) can share their structural information. The structure (called “shape”) is saved just once, and multiple documents can re-use it by storing just a pointer to their “shape”.
In practice, documents in a collection are likely to be homogenous, and sharing the structure data between multiple documents can greatly reduce disk storage space and memory usage for documents.

Q3. Who is currently using ArangoDB for what?

Frank Celler: ArangoDB is open source. You don’t have to register to download the source code or precompiled binaries. As a user, you can get support via Google Group, GitHub’s issue tracker and even via Twitter. We are very amenable, which is an essential part of the project. The drawback is that we don’t really know what people are doing with ArangoDB in detail. We are noticing an exponentially increasing number of downloads over the last months.
We are aware of a broad range of use cases: a CMS, a high-performance logging component, a geo-coding tool, an annotation system for animations, just to name a few. Other interesting use cases are single page apps or mobile apps via Foxx, ArangoDB’s application framework. Many of our users have in-production experience with other NoSQL databases, especially the leading document stores.

Q4. Could you motivate your design decision to use Google’s V8 JavaScript engine?

Martin Schönert: ArangoDB uses Google’s V8 engine to execute server-side JavaScript functions. Users can write server-side business logic in JavaScript and deploy it in ArangoDB. These so-called “actions” are much like stored procedures living close to the data.
For example, with actions it is possible to perform cascading deletes/updates, assign permissions, and do additional calculations and modifications to the data.
ArangoDB also allows users to map URLs to custom actions, making it usable as an application server that handles client HTTP requests with user-defined business logic.
We opted for Javascript as it meets our requirements for an “embedded language” in the database context:
• Javascript is widely used. Regardless in which “back-end language” web developers write their code, almost everybody can code also in Javascript.
• Javascript is effective and still modern.
Just as well, we chose Google V8, as it is the fastest, most stable Javascript interpreter available for the time being.

Q5. How do you query ArangoDB if you don’t want to use JavaScript?

Frank Celler: ArangoDB offers a couple of options for getting data out of the database. It has a REST interface for CRUD operations and also allows “querying by example”. “Querying by example” means that you create a JSON document with the attributes you are looking for. The database returns all documents which look like the “example document”.
Expressing complex queries as JSON documents can become a tedious task—and it’s almost impossible to support joins following this approach. We wanted a convenient and easy-to-learn way to execute even complex queries, not involving any programming as in an approach based on map/reduce. As ArangoDB supports multiple data models including graphs, it was neither sufficient to stick to SQL nor to simply implement UNQL. We ended up with the “ArangoDB query language” (AQL), a declarative language similar to SQL and Jsoniq. AQL supports joins, graph queries, list iteration, results filtering, results projection, sorting, variables, grouping, aggregate functions, unions, and intersections.
Of course, ArangoDB also offers drivers for all major programming languages. The drivers wrap the mentioned query options following the paradigm of the programming language and/or frameworks like Ruby on Rails.

Q6. How do you perform graph queries? How does this differ from systems such as Neo4J?

Frank Celler: SQL can’t cope with the required semantics to express the relationships between graph nodes, so graph databases have to provide other ways to access the data.
The first option is to write small programs, so called “path traversals.” In ArangoDB, you use Javascript; in neo4j Java, the general approach is very similar.
Programming gives you all the freedom to do whatever comes to your mind. That’s good. For standard use cases, programming might be too much effort. So, both ArangoDB and neo4j offer a declarative language—neo4j has “Cypher,” ArangoDB the “ArangoDB Query Language.” Both also implement the blueprints standard so that you can use “Gremlin” as query-language inside Java. We already mentioned that ArangoDB is a multi-model database: AQL covers documents and graphs, it provides support for joins, lists, variables, and does much more.

The following example is taken from the neo4j website:

“For example, here is a query which finds a user called John in an index and then traverses the graph looking for friends of John’s friends (though not his direct friends) before returning both John and any friends-of-friends that are found.

START john=node:node_auto_index(name = ‘John’)
MATCH john-[:friend]->()-[:friend]->fof
RETURN john, fof ”

The same query looks in AQL like this:

FOR t IN TRAVERSAL(users, friends, “users/john”, “outbound”,
{minDepth: 2}) RETURN t.vertex._key

The result is:
[ “maria”, “steve” ]

You see that Cypher describes patterns while AQL describes joins. Internally, ArangoDB has a library of graph functions—those functions return collections of paths and paths or use those collections in a join.

Q7. How did you design ArangoDB to scale out and/or scale up? Please give us some detail.

Martin Schönert: Solid state disks are becoming more and more a commodity hardware. ArangoDB’s append-only design is a perfect fit for such SSD, allowing for data-sets which are much bigger than the main memory but still fit unto a solid state disk.
ArangoDB supports master/slave replication in version 1.4 which will be released in the next days (a beta has been available for some time). On the one hand this provides easy fail-over setups. On the other hand it provides a simple way to scale the read-performance.
Sharding is implemented in version 2.0. This enables you to store even bigger data-sets and increase the write-performance. As noted before, however, we see our main application when scaling to a low number of nodes. We don’t plan to optimize ArangoDB for massive scaling with hundreds of nodes. Plain key/value stores are much more usable in such scenarios.

Q8. What is ArangoDB’s update and delete strategy?

Martin Schönert: ArangoDB versions prior to 1.3 store all revisions of documents in an append-only fashion; the objects will never be overwritten. The latest version of a document is available to the end user.

With the current version 1.3, ArangoDB introduces transactions and sets the technical fundament for replication and sharding. In the course of those highly wanted features comes “real” MVCC with concurrent writes.

In databases implementing an append-only strategy, obsolete versions of a document have to be removed to save space. As we already mentioned, ArangoDB is multi-threaded: The so-called compaction is automatically done in the background in a different thread without blocking reads and writes.

Q9. How does ArangoDB differ from other NoSQL data stores such as Couchbase and MongoDB and graph data stores such as Neo4j, to name a few?

Frank Celler: ArangoDB’s feature scope is driven by the idea to give the developer everything he needs to master typical tasks in a web application—in a convenient and technically sophisticated way alike.
From our point of view it’s the combination of features and quality of the product which accounts for ArangoDB: ArangoDB not only handles documents but also graphs.
ArangoDB is extensible via Javascript and Ruby. Enclosed with ArangoDB you get “Foxx”. Foxx is an integrated application framework ideal for lean back-ends and single page Javascript applications (SPA).
Multi-collection transactions are useful not only for online banking and e-commerce but they become crucial in any web app in a distributed architecture. Here again, we offer the developers many choices. If transactions are needed, developers can use them.
If, on the other hand, the problem requires a higher performance and less transaction-safety, developers are free to ignore multi-collections transactions and to use the standard single-document transactions implemented by most NoSQL databases.
Another unique feature is ArangoDB’s query language AQL—it makes querying powerful and convenient. For simple queries, we offer a simple query-by-example interface. Then again, AQL enables you to describe complex filter conditions and joins in a readable format.

Q10. Could you summarize the main results of your benchmarking tests?

Frank Celler: To quote Jan Lenhardt from CouchDB: “Nosql is not about performance, scaling, dropping ACID or hating SQL—it is about choice. As nosql databases are somewhat different it does not help very much to compare the databases by their throughput and chose the one which is fasted. Instead—the user should carefully think about his overall requirements and weight the different aspects. Massively scalable key/value stores or memory-only system[s] can archive much higher benchmarks. But your aim is [to] provide a much more convenient system for a broader range of use-cases—which is fast enough for almost all cases.”
Anyway, we have done a lot of performance tests and are more than happy with the results. ArangoDB 1.3 inserts up to 140,000 documents per second. We are going to publish the whole test suite including a test runner soon, so everybody can try it out on his own hardware.

We have also tested the space usage: Storing 3.5 millions AQL search queries takes about 200 MB in MongoDB with pre-allocation compared to 55 MB in ArangoDB. This is the benefit of implementing the concept of shapes.

Q11. ArangoDB is open source. How do you motivate and involve the open source development community to contribute to your projects rather than any other open source NoSQL?

Frank Celler: To be honest: The contributors come of their own volition and until now we didn’t have to “push” interested parties. Obviously, ArangoDB is fascinating enough, even though there are more than 150 NoSQL databases available to choose from.

It all started when Ruby inventor Yukihiro “Matz” Matsumoto tweeted on ArangoDB and recommended it to the community. Following this tweet, ArangoDB’s first local fan base was established in Japan—and we learned a lot about the limits of automatic translation from Japanese tweets to English and the other way around ;-).

In our daily “work” with our community, we try to be as open and supportive as possible. The core developers communicate directly and within short response times with people having ideas or needing help through Google Groups or GitHub. We take care of a community, especially for contributors, where we discuss future features and inform about upcoming changes early so that API contributors can keep their implementations up to date.

——————————
Martin Schönert
Martin is the origin of many fancy ideas in ArangoDB. As chief architect he is responsible for the overall architecture of the system, bringing in his experience from more than 20 years in IT as developer, architect, project manager and entrepreneur.
Martin started his career as scientist at the technical university of Aachen after earning his degree in Mathematics. Later he worked as head of product development (Team4 Systemhaus), Director IT (OnVista Technologies) and head of division at Deutsche Post.
Martin has been working with relational and non-relations databases (e.g. a torrid love-hate relationsship with the granddaddy of all non-relational databases: Lotus Notes) for the largest part of his professional life.
When no database did what he needed he also wrote his own, one for extremely high update rate and the other for distributed caching
.

Frank Celler
Frank is both entrepreneur and backend developer, developing mostly memory databases for two decades. He is the lead developer of ArangoDB and co-founder of triAGENS. Besides Frank organizes Cologne’s NoSQL user group, NoSQL conferences and is speaking at developer conferences.
Frank studied in Aachen and London and received a PHD in Mathematics. Prior to founding triAGENS, the company behind ArangoDB, he worked for several German tech companies as consultant, team lead and developer.
His technical focus is C and C++, recently he gained some experience with Ruby when integrating Mruby into ArangoDB.

Resources

– The stable version (1.3 branch) of ArangoDB can be downloaded here.
ArangoDB on Twitter
ArangoDB Google Group
ArangoDB questions on StackOverflow
Issue Tracker at Github

Related Posts

On geo-distributed data management — Interview with Adam Abrevaya. October 19, 2013

On Big Data and NoSQL. Interview with Renat Khasanshyn. October 7, 2013

On NoSQL. Interview with Rick Cattell. August 19, 2013

On Big Graph Data. August 6, 2012

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2013/10/on-multi-model-databases-interview-with-martin-schonert-and-frank-celler/feed/ 0
On Big Data and NoSQL. Interview with Renat Khasanshyn. http://www.odbms.org/blog/2013/10/on-big-data-and-nosql-interview-with-renat-khasanshyn/ http://www.odbms.org/blog/2013/10/on-big-data-and-nosql-interview-with-renat-khasanshyn/#comments Mon, 07 Oct 2013 19:50:27 +0000 http://www.odbms.org/blog/?p=2636

“The most important thing is to focus on a task you need to solve instead of a technology” –Renat Khasanshyn.

I have interviewed Renat Khasanshyn, Founder and Chairman of Altoros.
Renat is a NoSQL and Big Data specialist.

RVZ

Q1. In your opinion, what are the most popular players in the NoSQL market?

Khasanshyn: I think, MongoDB is definitely one of the key players of the NoSQL market. This database has a long history, I mean for this kind of products, and a good commercial support. For many people this database became the first mass market NoSQL store. I can assume that MongoDB is going to become something like MySQL for the field of relational databases. The second position I would give to Cassandra. It has a great architecture and enables building clusters with geographically dispersed nodes. For me it seems absolutely amazing. In addition, this database is often chosen by big companies that need a large highly available cluster.

Q2. How do you evaluate and compare different NoSQL Databases?

Khasanshyn: Thank you for an interesting question. How to choose a database? Which one is the best? These are the main questions for any company that wants to try a NoSQL solution. Definitely, for some cases it may be quite easy to select a suitable NoSQL store. However, very often it is not enough just to know customer’s business goals. When we suggest a particular database we take into consideration the following factors: the business issues a NoSQL store should solve, database reading/writing speed, availability, scalability, and many other important indicators. Sometimes we use a hybrid solution that may include several NoSQL databases.
Or we can see that a relational database will be a good match for a case. The most important thing is to focus on a task you need to solve instead of a technology.

We think that a good scalability, performance, and ease of administration are the most important criteria for choosing a NoSQL database. These are the key factors that we take into consideration. Of course, there are some additional criteria that sometimes may be even more important than those mentioned above. To somehow simplify a choice of a database for our engineers and for many other people, for two years, we carry out independent tests that evaluate performance of different NoSQL databases. Although aimed at comparing performance, these investigations also touch consistency, scalability, and configuration issues. You can take a look at our most recent article on this subject on Network World. Some new researches on this subject are to be published in a couple of months.

Q3. Which NoSQL databases did you evaluate so far, and what are the results did you obtain?

Khasanshyn: We used a great variety of NoSQL databases, for instance, Cassandra, HBase, MongoDB, Riak, Couchbase Server, Redis, etc. for our researches and real-life projects. From this experience, we learned that one should be very careful when choosing a database. It is better to spend more time on architecture design and make some changes to the project in the beginning rather than come across a serious issue in the future.

Q4. For which projects did use NoSQL databases and for what kind of problems?

Khasanshyn: It is hardly possible to name a project for which a NoSQL database would be useless, except for a blog or a home page. As the main use cases for NoSQL stores I would mention the following tasks:

● collecting and analyzing large volumes of data
● scaling large historical databases
● building interactive applications for which performance and fast response time to users’ actions are crucial

The major “drawback” of NoSQL architecture is the absence of ACID engine that provides a verification of transaction. It means that financial operations or user registration should be better performed by RDBMS like Oracle or MS SQL. However, absence of ACID allows for significant acceleration and decentralization of NoSQL databases which are their major advantages. The bottom line, non-relational databases are much faster in general, and they pay for it with a fraction of their reliability. Is it a good tradeoff? Depends on the task.

Q5. What do you see happening with NoSQL, going forward?

Khasanshyn: It’s quite difficult to make any predictions, but we guess that NoSQL and relational databases will become closer. For instance, NewSQL solutions took good scalability from NoSQL and a query language from the SQL products.
Probably, a kind of a standard query language based on SQL or SQL-like language will soon appear for NoSQL stores. We are also looking forward to improved consistency, or to be more precise, better predictability of NoSQL databases. In other words, NoSQL solutions should become more mature. We will also see some market consolidation. Smaller players will form alliances or quit the game. Leaders will take bigger part of the market. We will most likely see a couple of acquisitions. Overall, it will be easier to work with NoSQL and to choose a right solution out of the available options.

Q6. What what do you think is still needed for big data analytics to be really useful for the enterprise?

Khasanshyn: It is all about people and their skills. Storage is relatively cheap and available. Variety of databases is enormous and it helps solving virtually any task. Hadoop is stable. Majority of software is open source-based, or at least doesn’t cost a fortune. But all these components are useless without data scientists who can do modeling and simulations on the existing data. As well as without engineers who can efficiently employ the toolset. As well as without managers who understand the outcomes of data handling revolution that happened just recently. When we have these three types of people in place, then we will say that enterprises are fully equipped for making an edge in big data analytics.

Q7. Hadoop is still quite new for many enterprises, and different enterprises are at different stages in their Hadoop journey. When you speak with your customers what are the typical use cases and requirements they have?

Khasanshyn: I agree with you. Some customers just make their first steps with Hadoop while others need to know how to optimize their Hadoop-based systems. Unfortunately, the second group of customers is much smaller. I can name the following typical tasks our customers have:

● To process historical data that has been collected for a long period of time. Some time ago, users were unable to process large volumes of unstructured data due to some financial and technical limitations. Now Hadoop can do it at a moderate cost and for reasonable time.

● To develop a system for data analysis based on Hadoop. Once an hour, the system builds patterns on a typical user behavior on a Web site. These patterns help to react to users’ actions in the real-time mode, for instance, allow doing something or temporary block some actions because they are not typical of this user. The data is collected continuously and is analyzed at the same time. So, the system can rapidly respond to the changes in the user behavior.

● To optimize data storage. It is interesting that in some cases HDFS can replace a database, especially when the database was used for storing raw input data. Such projects do not need an additional database level.

I should say that our customers have similar requirements. Apart from solving a particular business task, they need a certain level of performance and data consistency.

Q8. In your opinion is Hadoop replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?

Khasanshyn: In a few words, my answer is yes. Some specific features of Hadoop enable to prepare data for future analysis fast and at a moderate cost. In addition, this technology can work with unstructured data. However, I do not think it will happen very soon. There are many OLAP systems and they solve their tasks, doing it better or worse. In general enterprises are very reluctant to change something. In addition, replacing the OLAP tools requires additional investments. The good news is that we don’t have to choose one or another. Hadoop can be used as a pre-processor of the data for OLAP analysis. And analysts can work with the familiar tools.

Q9. How do you categorize the various stages of the Hadoop usage in the enterprises?

Khasanshyn: I would name the following stages of Hadoop usage:

1. Development of prototypes to check out whether Hadoop is suitable for their tasks
2. Using Hadoop in combination with other tools for storing and processing data of some business units
3. Implementation of a centralized enterprise data storage system and gradual integration of all business units into it

Q10. Data Warehouse vs Big “Data Lake”. What are the similarities and what are the differences?

Khasanshyn: Even though Big “Data Lake” is a metaphor, I do not really like it.
This name highlights that it is something limited, isolated from other systems. I would better call this concept a “Big Data Ocean”, because the data added to the system can interact with the surrounding systems. In my opinion, data warehouses are a bit outdated. At the earlier stage, such systems enabled companies to aggregate a lot of data in one central storage and arrange this data. All this was done with the acceptable speed. Now there are a lot of cheap storage solutions, so we can keep enormous volumes of data and process it much faster than with data warehouses.

The possibility to store large volumes of data is a common feature of data warehouses and a Big “Data Lake”. With a flexible architectures and broad capabilities for data analysis and discovery, a Big “Data Lake” provides a wider range of business opportunities. A modern company should adjust to changes very fast. The structure that was good yesterday may become a limitation today.

Q11. In your opinion, is there a technology which is best suited to build a Big Data Analytics Data Platform? If yes, which one?

Khasanshyn: As I have already said, there is no “magic bullet” that can cure every disease. There is no “Universal Big Data Analytics Data Platform” fitting each size; everything depends on the requirements of a particular business. A system of this kind should have the following features:

● A Hadoop-based system for storing and processing data
● An operational database that contains the most recent data, it can be raw data, that should be analyzed.
A NoSQL solution can be used for this case.
● A database for storing predicted indicators. Most probably, it should be a relational data store.
● A system that allows for creating data analysis algorithms. The R language can be used to complete this task.
● A report building system that provides access to data. For instance, there such good options like Tableau or Pentaho.

Q12. What about elastic computing in the Cloud? How does it relate to Big Data Analytics?

Khasanshyn: In my opinion, cloud computing became the force that raised a Big Data wave. Elastic computing enabled us to use the amount of computation resources we need and also reduced the cost of maintaining a large infrastructure.

There is a connection between elastic computing and big data analytics. For us it is quite a typical case that we have to process data from time to time, for instance, once a day. To solve this task, we can deploy a new cluster or just scale up an existing Hadoop cluster in the cloud environment. We can temporary increase the speed of data processing by scaling a Hadoop cluster. The task will be executed faster and after that we can stop the Hadoop cluster or reduce its size. I can even say, that cloud technologies is a must have component for a Big Data analysis system.

——-

Renat Khasanshyn, Founder and Chairman, Altoros.
Renat is founder & CEO of Altoros, and Venture Partner at Runa Capital. Renat helps define Altoros’s strategic vision, and its role in Big Data, Cloud Computing, and PaaS ecosystem. Renat is a frequent conference and meetup speaker on this topic.
Under his supervision Altoros has been servicing such innovative companies as Couchbase, RightScale, Canonical, DataStax, Joyent, Nephoscale, and NuoDB.
In the past, Renat has been selected as finalist for the Emerging Executive of the Year award by the Massachusetts Technology Leadership Council and once won an IBM Business Mashup Challenge. Prior to founding Altoros, Renat was VP of Engineering for Tampa-based insurance company PriMed. Renat is also founder of Apatar, an open source data integration toolset, founder of Silicon Valley NewSQL User Group and co-founder of the Belarusian Java User Group.

——————–
Related Posts

On NoSQL. Interview with Rick Cattell. August 19, 2013

On Oracle NoSQL Database –Interview with Dave Segleau. July 2, 2013

On Big Data and Hadoop. Interview with Paul C. Zikopoulos. June 10, 2013
————————-

Resources

Evaluating NoSQL Performance, Sergey Bushik, Altoros (Slideshare)

“A Vendor-independent Comparison of NoSQL Databases: Cassandra, HBase, MongoDB, Riak”. Altoros (registration required)

ODBMS.org free resources on Big Data and Analytical Data Platforms:
Blog Posts | Free Software| Articles | Lecture Notes | PhD and Master Thesis|

ODBMS.org free resources on NoSQL Data Stores:
Blog Posts | Free Software | Articles, Papers, Presentations| Documentations, Tutorials, Lecture Notes | PhD and Master Thesis

ODBMS.org free resources on Cloud Data Stores:
Blog Posts | Lecture Notes| Articles and Presentations| PhD and Master Thesis|

———————–

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2013/10/on-big-data-and-nosql-interview-with-renat-khasanshyn/feed/ 1
On NoSQL. Interview with Rick Cattell. http://www.odbms.org/blog/2013/08/on-nosql-interview-with-rick-cattell/ http://www.odbms.org/blog/2013/08/on-nosql-interview-with-rick-cattell/#comments Mon, 19 Aug 2013 07:48:00 +0000 http://www.odbms.org/blog/?p=2576

” There aren’t enough open source contributors to keep projects competitive in features and performance, and the companies supporting the open source offerings will have trouble making enough money to keep the products competitive themselves. Likewise, companies with closed source will have trouble finding customers willing to risk a closed source (or limited open source) solution. It will be interesting to see what happens. But I don’t see NoSQL going away, there is a well-established following.” –Rick Cattell.

I have asked Rick Cattell, one of the leading independent consultants in database systems, a few questions on NoSQL.

RVZ

Q1. For years, you have been studying the NoSQL area and writing articles about scalable databases. What is new in the last year, in your view? What is changing?

Rick Cattell: It seems like there’s a new NoSQL player every month or two, now!
It’s hard to keep track of them all. However, a few players have become much more popular than the others.

Q2. Which players are those?

Rick Cattell: Among the open source players, I hear the most about MongoDB, Cassandra, and Riak now, and often HBase and Redis. However, don’t forget that the proprietary players like Amazon, Oracle, and Google have NoSQL systems as well.

Q3. How do you define “NoSQL”?

Rick Cattell: I use the term to mean systems that provide a simple operations like key/value storage or simple records and indexes, and that focus on horizontal scalability for those simple operations. Some people categorize horizontally scaled graph databases and object databases to be “NoSQL” as well. However, those systems have very different characteristics. Graph databases and object databases have to efficiently break connections up over distributed servers, and have to provide operations that somehow span servers as you traverse the graph. Distributed graph/object databases have been around for a while, but efficient distribution is a hard problem. The NoSQL databases simply distribute (or shard) each data type based on a primary key; that’s easier to do efficiently.

Q4. What other categories of systems do you see?

Rick Cattell: Well, there are systems that focus on horizontal scaling for full SQL with joins, which are generally called “NewSQL“, and systems optimized for “Big Data” analytics, typically based on Hadoop map/reduce. And of course, you can also sub-categorize the NoSQL systems based on their data model and distribution model.

Q5. What subcategories would those be?

Rick Cattell: On data model, I separate them into document databases like MongoDB and CouchBase, simple key/value stores like Riak and Redis, and grouped-column stores like HBase and Cassandra. However, a categorization by data model is deceptive, because they also differ quite a bit in their performance and concurrency guarantees.

Q6: Which systems perform best?

Rick Cattell: That’s hard to answer. Performance is not a scale from “good” to “bad”… the different systems have better performance for different kinds of applications. MongoDB performs incredibly well if all your data fits in distributed memory, for example, and Cassandra does a pretty good job of using disk, because of its scheme of writing new data to the end of disk files and consolidating later.

Q7: What about their concurrency guarantees?

Rick Cattell: They are all over the place on concurrency. The simplest provide no guarantees, only “eventual consistency“. You don’t know which version of data you’ll get with Cassandra. MongoDB can keep a “primary” replica consistent if you can live with their rudimentary locking mechanism.
Some of the new systems try to provide full ACID transactions. FoundationDB and Oracle NoSQL claim to do that, but I haven’t yet verified that. I have studied Google’s Spanner paper, and they do provide true ACID consistency in a distributed world, for most practical purposes. Many people think the CAP theorem makes that impossible, but I believe their interpretation of the theorem is too narrow: most real applications can have their cake and eat it too, given the right distribution model. By the way, graph/object databases also provide ACID consistency, as does VoltDB, but as I mentioned I consider them a different category.

Q8: I notice you have an unpublished paper on your website, called 2x2x2 Requirements for Scalability. Can you explain what the 2x2x2 means?

Rick Cattell: Well, the first 2x means that there are two different kinds of scalability: horizontal scaling over multiple servers, and vertical scaling for performance on a single server. The remaining 2×2 means that there are two key features needed to achieve the horizontal and vertical scaling, and for each of those, there are two additional things you have to do to make the features practical.

Q9: What are those key features?

Rick Cattell: For horizontal scaling, you need to partition and replicate your data. But you also need automatic failure recovery and database evolution with no downtime, because when your database runs on 200 nodes, you can’t afford to take the database offline and you can’t afford operator intervention on every failure.
To achieve vertical scaling, you need to take advantage of RAM and you need to avoid random disk I/O. You also need to minimize the overhead for locking and latching, and you need to minimize network calls between servers. There are various ways to do that. The best systems have all eight of these key features. These eight features represent my summary of scalable databases in a nutshell.

Q10: What do you see happening with NoSQL, going forward?

Rick Cattell: Good question. I see a lot of consolidation happening… there are too many players! There aren’t enough open source contributors to keep projects competitive in features and performance, and the companies supporting the open source offerings will have trouble making enough money to keep the products competitive themselves. Likewise, companies with closed source will have trouble finding customers willing to risk a closed source (or limited open source) solution.
It will be interesting to see what happens. But I don’t see NoSQL going away, there is a well-established following.

————–
R. G. G. “Rick” Cattell is an independent consultant in database systems.
He previously worked as a Distinguished Engineer at Sun Microsystems, most recently on open source database systems and distributed database scaling. Dr. Cattell served for 20+ years at Sun Microsystems in management and senior technical roles, and for 10 years in research at Xerox PARC and at Carnegie-Mellon University. Dr. Cattell is best known for his contributions in database systems and middleware, including database scalability, enterprise Java, object/relational mapping, object-oriented databases, and database interfaces. He is the author of several dozen papers and five books, and a co-inventor of six U.S. patents.
At Sun he instigated the Enterprise Java, Java DB, and Java Blend projects, and was a contributor to a number of Java APIs and products. He previously developed the Cedar DBMS at Xerox PARC, the Sun Simplify database GUI, and SunSoft’s CORBA-database integration.
He is a co-founder of SQL Access (a predecessor to ODBC), the founder and chair of the Object Data Management Group (ODMG), the co-creator of JDBC, the author of the world’s first monograph on object/relational and object databases, a recipient of the ACM Outstanding PhD Dissertation Award, and an ACM Fellow.

Related Posts

On Oracle NoSQL Database –Interview with Dave Segleau. July 2, 2013

On Real Time NoSQL. Interview with Brian Bulkowski. May 21, 2013

Resources

Rick Cattell home page.

ODBMS.org Free Downloads and Links
In this section you can download free resources covering the following topics:
Big Data and Analytical Data Platforms
Cloud Data Stores
Object Databases
NoSQL Data Stores
Graphs and Data Stores
Object-Oriented Programming
Entity Framework (EF) Resources
ORM Technology
Object-Relational Impedance Mismatch
NewSQL, XML, RDF Data Stores, RDBMS

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2013/08/on-nosql-interview-with-rick-cattell/feed/ 0
On Oracle NoSQL Database –Interview with Dave Segleau. http://www.odbms.org/blog/2013/07/on-oracle-nosql-database-interview-with-dave-segleau/ http://www.odbms.org/blog/2013/07/on-oracle-nosql-database-interview-with-dave-segleau/#comments Tue, 02 Jul 2013 07:18:08 +0000 http://www.odbms.org/blog/?p=2454

“We went down the path of building Oracle NoSQL database because of explicit request from some of our largest Oracle Berkeley DB installations that wanted to move away from maintaining home grown sharding implementations and very much wanted an out of box technology that can replicate the robustness of what they had built “out of box” ” –Dave Segleau.

On October 3, 2011 Oracle announced the Oracle NoSQL Database, and on December 17, 2012, Oracle shipped Oracle NoSQL Database R2. I wanted to know more about the status of the Oracle NoSQL Database. I have interviewed Dave Segleau, Director of Product Management,Oracle NoSQL Database.

RVZ

Q1. Who is currently using Oracle NoSQL Database, and for what kind of domain specific applications? Please give us some examples.

Dave Segleau: There are a range of users from segments such as Web-scale Transaction Processing, to Web-scale Personalization and Real-time Event Processing. To pick the area where I would say we see the largest adoption, it would be the Real-time Event Processing category. This is basically the use case that covers things like Fraud Detection, Telecom Services Billing, Online Gaming and Mobile Device Management.

Q2. What is new in Oracle NoSQL Database R2?

Dave Segleau: We added significant enhancements to NoSQL Database in the areas of Configuration Management/Monitoring (CM/M), APIs and Application Developer Usability, as well as Integration with the Oracle technology stack.
In the area of CM/M, we added “Smart Topology” ( an automated capacity and reliability-aware data storage allocation with intelligent request routing), configuration elasticity and rebalancing, JMX/SNMP support. In the area of APIs and Application Developer Usability we added a C API, support for values as JSON objects (with AVRO serialization), JSON schema definitions, and a Large Object API (including a highly efficient streaming interface). In the area of Integration we added support for accessing NoSQL Database data via Oracle External Tables (using SQL in the Oracle Database), RDF Graph support in NoSQL Database, Oracle Coherence as well as integration with Oracle Event Processing.

Q3. How would you compare Oracle NoSQL with respect to other NoSQL data stores, such as CouchDB, MongoDB, Cassandra and Riak?

Dave Segleau: The Oracle NoSQL Database is a key-value store, although it also supports JSON as a value type similar to a document store. Architecturally it is closer to Riak, Cassandra and the Amazon Dynamo-based implementations, rather than the other technologies, at least at the highest level of abstraction. With regards to features, Oracle NoSQL Database shares a lot of commonality with Riak. Our performance and scalability characteristics are showing up with the best results in YCSB benchmarks.

Q4. What is the implication of having Oracle Berkeley DB Java Edition as the core engine for the Oracle NoSQL database?

Dave Segleau: It means that Oracle NoSQL Database provides a mission-critical proven database technology at the heart of the implementation. Many of the other NoSQL databases use relatively new implementations for data storage and replication. Databases in general, and especially distributed parallel databases, are hard to implement well and achieve high product quality and reliability. So we see the use of Oracle Berkeley DB, a pervasively deployed database engine for 1000’s of mission-critical applications, as a big differentiation. Plus, many of the early NoSQL technologies are based on Oracle Berkeley DB, for example LinkedIn’s Voldemort, Amazon’s Dynamo and other popular commercial and enterprise social media products like Yammer.
The bottom line is that we went down the path of building Oracle NoSQL database because of explicit request from some of our largest Oracle Berkeley DB installations that wanted to move away from maintaining home grown sharding implementations and very much wanted an out of box technology that can replicate the robustness of what they had built “out of box”.

Q5. What is the relationships between the underlying “cleaning” mechanism to free up unused space in Oracle Berkeley DB, and the predictability and throughput in Oracle NoSQL Database?

Dave Segleau: As mentioned in the previous section, Oracle NoSQL Database uses Oracle Berkeley DB Java Edition as the key-value storage mechanism within the Storage Nodes. Oracle Berkeley DB Java Edition uses a no-overwrite log file system to store the data and a configurable multi-threaded background log cleaner task to compact and clean log files and free up unused disk space. The Oracle Berkeley DB log cleaner has underdone many years of in-house and real world high volume validation and tuning. Oracle NoSQL Database pre-defines the BDB cleaner parameters for optimal configuration for this particular use case. The cleaner enhances system throughput and predictability by a) running as a low level background task, b) being preconfigured to minimize impact on the running system. The combination of these two characteristics leads to more predictable system throughput.

Several other NoSQL database products have implemented heavy weight tasks to compact, compress and free up disk space. Running them definitely impacts system throughput and predictability. From our point of view, not only do you want a NoSQL database that has excellent performance, but you also need predictable performance. Routine tasks like Java GCs and disk space management should not cause major impacts to operational throughput in a production system.

Q7. Oracle NoSQL data model is using the concepts of “major” and “minor” key path. Why?

Dave Segleau: We heard from customers that they wanted both even distribution of data as well as co-location of certain sets of records. The Major/Minor key paradigm provides the best of both worlds. The Major key is the basis for the hash function which causes Major key values to be evenly distributed throughput the key-value data store. The Minor key allows us to cluster multiple records for a given Major key together in the same storage location. In addition to being very flexible, it also provided additional benefits:
a) A scalable two-tier indexing structure. A hash map of Major Keys to partitions that contain the data, and then a B-tree within each partition to quickly locate Minor key values.
b) Minor keys allow us to perform efficient lookups and range scans within a Major key. For example, for userID 1234 (Major key), fetch all of the products that they browsed from January 1st to January 15th (Minor key).
c) Because all of the Minor key records for a given Major key are co-located on the same storage location, this becomes our basic unit of ACID transactions, allowing applications to have a transaction that spans a single record, multiple records or even multiple write operations on multiple records for a given major key.

This degree of flexibility is often lacking in other NoSQL database offerings.

Q8. Oracle NoSQL database is a distributed, replicated key-value store using a shared-nothing master-slave architecture. Why did you choose to have a master node architecture? How does it compare with other systems which have no master?

Dave Segleau: First of all, lets clarify that each shard has a master (multi master) and it is an elected master based system. The Oracle NoSQL Database topology is deployed with user-specified replication factor (how many copies of the data should the system maintain) and then using a PAXOS based mechanism, a master is elected. It is quite possible that a new master is elected under certain operating conditions. Plus, if you throw more hardware resources at the system, those “masters” will shift the data for which they are responsible, again to achieve the optimal latency profile. We are leveraging the enterprise grade replication technology that is widely deployed via the Oracle Berkeley DB Java Edition. Also, by using an elected master implementation, we can provide a fully ACID transaction on an operation by operation basis

Q9. It is known that when the master node for a particular key-value fails (or because of a network failure), some writes may get lost. What is the implication from an application point of view?

Dave Segleau: This is a complex question in that it depends largely on the type of durability requested for the operation and that is controlled by the developer. In general though, committed transactions acknowledged by a simple majority of nodes (our default durability) are not lost when a master fails. In the case of less aggressive durability policies, in-flight transactions that have been subject to network, disk, server failures, are handled similar to process failure in other database implementations, the transactions are rolled back. However, a new master will quickly be elected and future requests will go thru without a hitch. The applications can guard against such situations by handling exceptions and performing a retry.

Q10. Justin Sheehy of Basho in an interview said (1): “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense” Would you recommend to your clients to use Oracle NoSQL database for banking applications?

Dave Segleau: Absolutely. The Oracle NoSQL Database offers a range of transaction durability and consistency options on a per operation basis. The choice of eventual consistency is best made on a case by case basis, because while using it can open up new levels of scalability and performance, it does come with some risk and/or alternate processes which have a cost. Some NoSQL vendors don’t provide the options to leverage ACID transactions where they make sense, but the Oracle NoSQL Database does.

Q11. Could you detail how Elasticity is provided in R2?

Dave Segleau: The Oracle NoSQL database slices data up into partitions within highly available replication groups. Each replication group contains an elected master and a number of replicas based on user configuration. The exact configuration will vary depending on the read latency /write throughput requirements of the application. The processes associated with those replication groups run on hardware (Storage Nodes) declared to the Oracle NoSQL Database. For elasticity purposes, additional Storage Nodes can be declaratively added to a running system, in which case some of the data partitions will be re-allocated onto the new hardware, thereby increasing the number of shards and the write throughput. Additionally, the number of replicas can be increased to improved read latency and increase reliability. The process of rebalancing data partitions, spawning new replicas, and forming new Replication Groups will cause those internal data partitions to automatically move around the Storage Nodes to take advantage of the new storage capacity.

Q12. What is the implication from a developer perspective of having a Avro Schema Support?

Dave Segleau: For the developer, it means better support for seamless JSON storage. There are other downstream implications, like compatibility and integration with Hadoop processing where AVRO is quickly becoming a standard not only for efficient wireline serialization protocols, but for HDFS storage. Also, AVRO is a very efficient serialization format for JSON, unlike other serialization options like BSON which tend to be much less efficient. In the future, Oracle NoSQL Database will leverage this flexible AVRO schema definition in order to provide features like table abstractions, projections and secondary index support.

Q13. How do you handle Large Object Support?

Dave Segleau: Oracle NoSQL Database provides a streaming interface for Large Objects. Internally, we break a Large Object up into chunks and use parallel operations to read and write those chunks to/from the database. We do it in an ordered fashion so that you can begin consuming the data stream before all of the contents are returned to the application. This is useful when implementing functionality like scrolling partial results, streaming video, etc. Large Object operations are restartable and recoverable. Let’s say that you start to write a 1 GB Large Object and sometime during the write operation a failure occurs and the write is only partially completed. The application will get an exception. When the application re-issues the Large Object operation, NoSQL resumes where it left off, skipping chunks that were already successfully written.
The Large Object chunking implementation also ensures that partially written Large Objects are not readable until they are completely written.

Q14. A NoSQL Database can act as an Oracle Database External Table. What does it mean in practice?

Dave Segleau: What we have achieved here is the ability to treat the Oracle NoSQL Database as a resource that can participate in SQL queries originating from an Oracle Database via standard SQL query facilities. Of course, the developer has to define a template that maps the “value” into a table representation. In release 2 we provide sample templates and configuration files that the application developer can use in order to define the required components. In the future, Oracle NoSQL Database will automate template definitions for JSON values. External Table capabilities give seamless access to both structured relational and unstructured NoSQL data using familiar SQL tools.

Q15. Why Atomic Batching is important?

Dave Segleau: If by Atomic Batching you mean the ability to perform more than one data manipulation in a single transaction, then atomic batching is the only real way to ensure logical consistency in multi-data update transactions. The Oracle NoSQL Database provides this capability for data beneath a given major key.

Q16 What are the suggested criteria for users when they need to choose between durability for lower latency, higher throughput and write availability?

Dave Segleau: That’s a tough one to answer, since it is so case by case dependent. As discussed above in the banking question, in general if you can achieve your application latency goals while specifying high durability, then that’s your best course of action. However, if you have more aggressive low-latency/high-throughput requirements, you may have to assess the impact of relaxing your durability constraints and of the rare case where a write operation may fail. It’s useful to keep in mind that a write failure is a rare event because of the inherent reliability built into the technology.

Q17. Tomas Ulin mentioned in an interview (2) that “with MySQL 5.6, developers can now commingle the “best of both worlds” with fast key-value look up operations and complex SQL queries to meet user and application specific requirements”. Isn’t MySQL 5.6 in fact competing with Oracle NoSQL database?

Dave Segleau: MySQL is an SQL database with a KV API on top. We are a KV database. If you have an SQL application with occasional need for fast KV access, MySQL is your best option. If you need pure KV access with unlimited scalability, then NoSQL DB is your best option.

———
David Segleau is the Director Product Management for the Oracle NoSQL Database, Oracle Berkeley DB and Oracle Database Mobile Server. He joined Oracle as the VP of Engineering for Sleepycat Software (makers of Berkeley DB). He has more than 30 years of industry experience, leading and managing technical product teams and working extensively with database technology as both a customer and a vendor.

Related Posts

(1) On Eventual Consistency– An interview with Justin Sheehy. August 15, 2012.

(2) MySQL-State of the Union. Interview with Tomas Ulin. February 11, 2013.

Resources

Charles Lamb’s Blog

ODBMS.org: Resources on NoSQL Data Stores:
Blog Posts | Free Software | Articles, Papers, Presentations| Documentations, Tutorials, Lecture Notes | PhD and Master Thesis.

Follow ODBMS.org on Twitter: @odbmsorg
##

]]>
http://www.odbms.org/blog/2013/07/on-oracle-nosql-database-interview-with-dave-segleau/feed/ 0
On PostgreSQL. Interview with Tom Kincaid. http://www.odbms.org/blog/2013/05/on-postgresql-interview-with-tom-kincaid/ http://www.odbms.org/blog/2013/05/on-postgresql-interview-with-tom-kincaid/#comments Thu, 30 May 2013 10:05:20 +0000 http://www.odbms.org/blog/?p=2351

“Application designers need to start by thinking about what level of data integrity they need, rather than what they want, and then design their technology stack around that reality. Everyone would like a database that guarantees perfect availability, perfect consistency, instantaneous response times, and infinite throughput, but it´s not possible to create a
product with all of those properties”
–Tom Kincaid.

What is new with PostgreSQL? I have Interviewed Tom Kincaid, head of Products and Engineering at EnterpriseDB.

RVZ

(Tom prepared the following responses with contributions from the EnterpriseDB development team)

Q1. EnterpriseDB products are based upon PostgreSQL. What is special about your product offering?

Tom Kincaid: EnterpriseDB has integrated many enterprise features and performance enhancements into the core PostgreSQL code to create a database with the lowest possible TCO and provide the “last mile” of service needed by enterprise database users.

EnterpriseDB´s Postgres Plus software provides the performance, security and Oracle compatibility needed to address a range of enterprise business applications. EnterpriseDB´s Oracle compatibility, also integrated into the PostgreSQL code base, allows many Oracle shops to realize a much lower database TCO while utilizing their Oracle skills and applications designed to work against Oracle databases.

EnterpriseDB also creates enterprise-grade tools around PostgreSQL and Postgres Plus Advanced Server for use in large-scale deployments. They are Postgres Enterprise Manager, a powerful management console for managing, monitoring and tuning databases en masse whether they´re PostgreSQL community version or EnterpriseDB´s enhanced Postgres Plus Advanced Server; xDB Replication Server with multi-master replication and replication between Postgres, Oracle and SQL Server databases; and SQL/Protect for guarding against SQL Injection attacks.

Q2. How does PostgreSQL compare with MariaDB and MySQL 5.6?

Tom Kincaid: There are several areas of difference. PostgreSQL has traditionally had a stronger focus on data integrity and compliance with the SQL standard.
MySQL has traditionally been focused on raw performance for simple queries, and a typical benchmark is the number of read queries per second that the database engine can carry out, while PostgreSQL tends to focus more on having a sophisticated query optimizer that can efficiently handle more complex queries, sometimes at the expense of speed on simpler queries. And, for a long time, MySQL had a big lead over PostgreSQL in the area of replication technologies, which discouraged many users from choosing PostgreSQL.

Over time, these differences have diminished. PostgreSQL´s replication options have expanded dramatically in the last three releases, and its performance on simple queries has greatly improved in the most recent release (9.2). On the other hand, MySQL and MariaDB have both done significant recent work on their query optimizers. So each product is learning from the strengths of the other.

Of course, there´s one other big difference, which is that PostgreSQL is an independent open source project that is not, and cannot be, controlled by any single company, while MySQL is now owned and controlled by Oracle.
MariaDB is primarily developed by the Monty Program and shows signs of growing community support, but it does not yet have the kind of independent community that PostgreSQL has long enjoyed.

Q3. Tomas Ulin mentioned in an interview that “with MySQL 5.6, developers can now commingle the “best of both worlds” with fast key-value look up operations and complex SQL queries to meet user and application specific requirements”. What is your take on this?

Tom Kincaid: I think anyone who is developing an RDBMS today has to be aware that there are some users who are looking for the features of a key-value store or document database.
On the other hand, many NoSQL vendors are looking to add the sorts of features that have traditionally been associated with an enterprise-grade RDBMS. So I think that theme of convergence is going to come up over and over again in different contexts.
That´s why, for example, PostgreSQL added a native JSON datatype as part of the 9.2 release, which is being further enhanced for the forthcoming 9.3 release.
Will we see a RESTful or memcached-like interface to PostgreSQL in the future? Perhaps.
Right now our customers are much more focused on improving and expanding the traditional RDBMS functionality, so that´s where our focus is as well.

Q4. How would you compare your product offering with respect to NoSQL data stores, such as CouchDB, MongoDB, Cassandra and Riak, and NewSQL such as NuoDB and VoltDB?

Tom Kincaid: It is a matter of the right tools for the right problem. Many of our customers use our products together with the NoSQL solutions you mention. If you need ACID transaction properties for your data, with savepoints and rollback capabilities, along with the ability to access data in a standardized way and a large third party tool set for doing it, a time tested relational database is the answer.
The SQL standard provides the benefit of always being able to switch products and having a host of tools for reporting and administration. PostgreSQL, like Linux, provides the benefit of being able to switch service partners.

If your use case does not mandate the benefits mentioned above and you have data sets in the Petabyte range and require the ability to ingest Terabytes of data every 3-4 hours, a NoSQL solution is likely the right answer. As I said earlier many of our customers use our database products together with NoSQL solutions quite successfully. We expect to be working with many of the NoSQL vendors in the coming year to offer a more integrated solution to our joint customers.

Since it is still pretty new, I haven´t had a chance to evaluate NuoDB so I can´t comment on how it compares with PostgreSQL or Postgres Plus Advanced Server.

As far as VoltDB is concerned there is a blog by Dave Page, our Chief Architect for tools and installers, that describes the differences between PostgreSQL and VoltDB. It can be found here.

There is also some terrific insight, on this topic, in an article by my colleague Bruce Momjian, who is one of the most active contributors to PostgreSQL, that can be found here.

Q5. Justin Sheehy of Basho in an interview said “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense”. What is your opinion on this?

Tom Kincaid: It´s overly simplistic. There is certainly room for asynchronous multi-master replication in applications such as banking, but it has to be done very, very carefully to avoid losing track of the money.
It´s not clear that the NoSQL products which provide eventual consistency today make the right trade-offs or provide enough control for serious enterprise applications – or that the products overall are sufficiently stable. Relational databases remain the most mature, time-tested, and stable solution for storing enterprise data.
NoSQL may be appealing for Internet-focused applications that must accommodate truly staggering volumes of requests, but we anticipate that the RDBMS will remain the technology of choice for most of the mission-critical applications it has served so well over the last 40 years.

Q6. What are the suggested criteria for users when they need to choose between durability for lower latency, higher throughput and write availability?

Tom Kincaid: Application designers need to start by thinking about what level of data integrity they need, rather than what they want, and then design their technology stack around that reality.
Everyone would like a database that guarantees perfect availability, perfect consistency, instantaneous response times, and infinite throughput, but it´s not possible to create a product with all of those properties.

If you have an application that has a large write throughput and you assume that you can store all of that data using a single database server, which has to scale vertically to meet the load, you´re going to be unhappy eventually. With a traditional RDBMS, you´re going to be unhappy when you can´t scale far enough vertically. With a distributed key-value store, you can avoid that problem, but then you have all the challenges of maintaining a distributed system, which can sometimes involve correlated failures, and it may also turn out that your application makes assumptions about data consistency that are difficult to guarantee in a distributed environment.

By making your assumptions explicit at the beginning of the project, you can consider alternative designs that might meet your needs better, such as incorporating mechanisms for dealing with data consistency issues or even application-level shading into the application itself.

Q7. How do you handle Large Objects Support?

Tom Kincaid: PostgreSQL supports storing objects up to 1GB in size in an ordinary database column.
For larger objects, there1s a separate large object API. In current releases, those objects are limited to just 2GB, but the next release of PostgreSQL (9.3) will increase that limit to 4TB. We don´t necessarily recommend storing objects that large in the database, though
in many cases, it´s more efficient to store enormous objects on a file server rather than as database objects. But the capabilities are there for those who need them.

Q8. Do you use Data Analytics at EnterpriseDB and for what?

Tom Kincaid: Most companies today use some form of data analytics to understand their customers and their marketplace and we1re no exception. However, how we use data is rapidly changing given our rapid growth and deepening
penetration into key markets.

Q9. Do you have customers who have Big Data problem? Could you please give us some examples of Big Data Use Cases?

Tom Kincaid: We have found that most customers with big data problems are using specialized appliances and in fact we partnered with Netezza to assist in creating such an appliance – The Netezza TwinFin Data Warehousing appliance.
See here.

Q10. How do you handle the Big Data Analytics “process” challenges with deriving insight?

Tom Kincaid: EnterpriseDB does not specialize in solutions for the Big Data market and will refer prospects to specialists like Netezza.

Q11. Do you handle un-structured data? If yes, how?

Tom Kincaid: PostgreSQL has an integrated full-text search capability that can be used for document processing, and there are also XML and JSON data types that can be used for data of those types. We also have a PostgreSQL-specific data type called hstore that can be used to store groups of key-value pairs.

Q12. Do you use Hadoop? If yes, what is your experience with Hadoop so far?

Tom Kincaid: We developed, and released in late 2011, our Postgres Plus Connector for Hadoop, which allows massive amounts of data from a Postgres Plus Advanced Server (PPAS) or PostgreSQL database to be accessed, processed and analyzed in a Hadoop cluster. The Postgres Plus Connector for Hadoop allows programmers to process large amounts of SQL-based data using their familiar MapReduce constructs. Hadoop combined with PPAS or PostgreSQL enables users to perform real time queries with Postgres and non-real time CPU intensive analysis and with our connector, users can load SQL data to Hadoop, process it and even push the results back to Postgres.

Q13 Cloud computing and open source: How does it relate to PostgreSQL?

Tom Kincaid: In 2012, EnterpriseDB released its Postgres Plus Cloud Database. We´re seeing a wide-scale migration to cloud computing across the enterprise. With that growth has come greater clarity in what developers need in a cloudified database. The solutions are expected to deliver lower costs and management ease with even greater functionality because they are taking advantage of the cloud.

______________________
Tom Kincaid.As head of Products and Engineering, Tom leads the company’s product development and directs the company’s world-class software engineers. Tom has nearly 25 years of experience in the Enterprise Software Industry.
Prior to EnterpriseDB, he was VP of software development for Oracle’s GlassFish and Web Tier products.
He integrated Sun’s Application Server Product line into Oracle’s Fusion middleware offerings. At Sun Microsystems, he was part of the original Java EE architecture and management teams and played a critical role in defining and delivering the Java Platform.
Tom is a veteran of the Object Database industry and helped build Object Design’s customer service department holding management and senior technical contributor roles. Other positions in Tom’s past include Director of Quality Engineering at Red Hat and Director of Software Engineering at Unica.

Related Posts

MySQL-State of the Union. Interview with Tomas Ulin. February 11, 2013

On Eventual Consistency– Interview with Monty Widenius. October 23, 2012

Resources

ODBMS.org: Relational Databases, NewSQL, XML Databases, RDF Data Stores
Blog Posts |Free Software | Articles and Presentations| Lecture Notes | Tutorials| Journals |

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2013/05/on-postgresql-interview-with-tom-kincaid/feed/ 0
On Real Time NoSQL. Interview with Brian Bulkowski. http://www.odbms.org/blog/2013/05/on-real-time-nosql-interview-with-brian-bulkowski/ http://www.odbms.org/blog/2013/05/on-real-time-nosql-interview-with-brian-bulkowski/#comments Tue, 21 May 2013 07:19:24 +0000 http://www.odbms.org/blog/?p=2316

“The benefits of hybrid storage are lower cost, and more varied price/performance options. We are seeing SSD pricing range down to $3/GB for enterprise drives, and less than $1/GB for consumer drives, and we also support using standard rotational drives for persistence. Our goal is in-memory performance with the price of ‘storage’ – which makes multi-terabyte analytics caches possible” –Brian Bulkowski.

I have interviewed Brian Bulkowski, Founder & CTO of Aerospike.

RVZ

Q1. What is Aerospike?
Brian Bulkowski: Aerospike is a high speed and highly available NoSQL database. We have proven levels of performance and reliability on commodity hardware that go far beyond any other database – and have the years of production experience to back up our claims.

Q2. How does it compare with other NoSQL or NewSQL databases?
Brian Bulkowski: Aerospike has specialized in areas where downtime is not an option – and at the highest levels of performance. Another area of specialization is applications where low latencies dramatically simplify application development.

As a row-based store with very high levels of read and write concurrency, we perform well where behavior and interactions are concerned – whether those interactions are user-oriented, such as advertising and gaming; or machine to machine, such as capital markets or the emerging Internet of Things.

Q3. What are the typical applications for which Aerospike is currently in use?
Brian Bulkowski: A number of our customers use Aerospike for real-time audience and behavior analysis. A common requirement is to hold profile information for several billion cookies, with a rich schema-free ability to add new data and types. In this use case, a cost-effective multi-terabyte store changes what a business can achieve.

Other use cases include massive multi-player gaming, where a shared game state (at scale) is key to a great user experience, and retaining peak capacity is critical.

Q4. Could you give us some detail on Aerospike`s hybrid memory architecture?
Brian Bulkowski: One of the greatest challenges with databases is keeping indexes in sync with data, and inserting and deleting index entries while under read load. In this area, in-memory techniques shine. An in-memory index can handle a high level of parallelism, and support simultaneous read and write loads. Indexes can be made very small – Aerospike has optimized index entries to one CPU cache line.

Aerospike also uses the technique of allowing a process to restart and retain its index, by using the shared memory facility of Linux. An index needs to be recalculated only when a machine is restarting, and in those cases it can be best to simply use the promoted replica during the short time the index is being re-created.

Q5. What is the main advantage of using a hybrid memory architecture?
Brian Bulkowski: The benefits of hybrid storage are lower cost, and more varied price/performance options. We are seeing SSD pricing range down to $3/GB for enterprise drives, and less than $1/GB for consumer drives, and we also support using standard rotational drives for persistence. Our goal is in-memory performance with the price of ‘storage’ – which makes multi-terabyte analytics caches possible.

Q6. How do you ensure scale up and scale out?
Brian Bulkowski: Scale up is maintained through short individual code paths, and single-server performance benchmarks. Our architecture focuses on enabling multiple cores, deployment configurations with appropriate interrupt routing rules, and classic database techniques like asynchronous IO.

Scale out is ensured through a number of techniques, such as enforced randomness through provably random storage distribution, a minimal and deterministic number of nodes participating in a transaction, and client techniques which limit the effect of a single slow node on other nodes.

Q7. Do you have any performance measures to share with us?
Brian Bulkowski: Please see this performance benchmark report from Thumbtack Technology, demonstrating how Aerospike is 10 times faster than MongoDB or Cassandra.

Please also see Aerospike Principal Architect Russ Sullivan’s High Scalability blog post on achieving 1M TPS on commodity hardware using Aerospike. This use case involves more system tuning than the Thumbtack tests.

Q8. How do you handle real-time big data analytics?
Brian Bulkowski: Many customers use Aerospike as a persistent part of their real-data analytics processing chain. Their application logic needs to know counts of recent visits, recent searches, preferences, preferences of friends, and also store batch analytics results. For example, you might calculate that a certain user is part of an audience, then keep up-to-date per audience counters.

By placing the calculations in the hands of the developer – with extreme low latency – we find developers and architects are able to scale increasingly powerful real-time computations.

Q9. You claim to provide 17x better TCO than other NoSQL databases. Could you please provide some evidence for this claim?
Brian Bulkowski: A great benefit of Aerospike is the ability to store data in Flash, and indexes in DRAM. This plays to the strengths of both SSD and DRAM, as memory accelerates the indexes and finding of the data, and then only one device data read is required to fetch (or write) the data.

Most NoSQL databases require all the data in memory in order to achieve high performance. A more detailed implementation is outlined in David Floyer’s Dec 12, 2012 Wikibon article.

Q10. How do you manage replication in Aerospike?
Brian Bulkowski: We commit each write to memory, and reserve space for the data to be written to disk. The writes are aggregated into large blocks to allow the streaming writes that are optimal for both SSDs and standard rotational disks, which passes through a short disk queue (often committing in less than a few milliseconds).

Each row has a current master server within the cluster. This mapping changes as servers are added and removed, and one or more replicas are chosen randomly from the remaining nodes.

Q11. What about data consistency when updating the data?
Brian Bulkowski: By applying the change to all relevant machines in the cluster, we achieve the important application requirement of a write taking effect immediately. By applying the write to multiple machines in-memory, we achieve very high reliability, because multiple machines must fail in the same millisecond to have uncommitted transactions. When the application code reads, they will always see the updated data by default – with no performance impact.

Q12. Could you please explain how cross data center synchronization works for Aerospike?
Brian Bulkowski: Our cross data center synchronization is a no-single-point of failure asynchronous replication mechanism. Asynchronous replication is important because a network outage or bandwidth issues impedes long haul links, but a data center can still serve requests based on its local cluster. The system can be operated both with a ‘star topology’, where a single master cluster is replicated to other more local data centers, or in a ‘ring topology’, where every cluster takes writes and those writes are forwarded around the ring. Conflicts can happen in a ring deployment if a row is modified in different locations at the same time.
This requires a conflict resolution system – like Aerospike’s versions – that retains conflicts for later resolution, or a heuristic – retaining a row with the most writes and a later timestamp.

Q13. What is the contribution of Don Haderle to Aerospike?
Brian Bulkowski: Don first exposed us to DB2‘s historical strategy of building very fast and reliable single-row operations, then adding higher level primitives. Primary key database operations are similar to storage operations in a file store, and have extraordinary business value. By deploying with many paying customers, we’ve proven our reliable and fast primary key layer – unlike other new, clustered databases which are still seeking large production proof.

———————
Brian Bulkowski, Co-Founder & CTO Aerospike
Brian has 20+ years experience designing, developing and tuning networking systems and high performance high scale infrastructures. He founded Aerospike after learning first hand, the scaling limitations of sharded MySQL systems at Aggregated Knowledge. As Director of Performance at this Media Intelligence SaaS company, Brian’s team built and operated a clustered recommendation engine. Prior to Aggregate Knowledge, Brian was a founding member of the digital TV team at Navio Communications and Chief Architect of Cable Solutions at Liberate Technologies where he built the high performance embedded networking stack and the internet scale broadcast server infrastructure. Before Liberate, Brian was a Lead Engineer at Novell, where he was responsible for the AppleTalk stack for Netware 3 and 4. Brian holds a B.S. from Brown University in Mathematics/Computer Science.

Resources

NoSQL Failover Characteristics: Aerospike, Cassandra, Couchbase, and MongoDB.
Ben Engber, CEO, Thumbtack Technology
Abstract:
Thumbtack’s latest whitepaper focuses on how NoSQL databases perform during hardware failures. This paper follows our previous study of NoSQL performance and explores how different consistency and durability models affect hardware planning and application design.
Paper | Advanced| English |DOWNLOAD (PDF)| MARCH 2013

Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs.
Ben Engber, CEO, Thumbtack Technology, JANUARY 2013
Abstract:
One question that comes up a lot in our discussions with customers is “Tell me which NoSQL database is best.” Of course, the answer to this depends on the use cases and the transactional needs of the specific customer. We regularly go through this analysis with customers, but one thing we find as we start is often the basic questions of how much consistency is needed and what are the tradeoffs involved that have not been answered. Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs is a benchmark report created to help organizations answer exactly those questions pertaining to NoSQL databases. Our goal was to create a benchmark that addressed a specific class of real-world problems with big data, unlike other published reports that are out there. We focused on how the durability and consistency needs affected raw performance.
Paper | Advanced| English | DOWNLOAD (PDF)| 2013|

Don Haderle, Father of IBM DB2, Talks about the Information Explosion:
Watch Video.

Related Posts

On Hybrid Relational Databases. Interview with Kingsley Uyi Idehen. May 13, 2013

** Follow ODBMS.org on Twitter:@odbmsorg

##

]]>
http://www.odbms.org/blog/2013/05/on-real-time-nosql-interview-with-brian-bulkowski/feed/ 0