Skip to content

"Trends and Information on Big Data, New Data Management Technologies, and Innovation."

This is the Industry Watch blog. To see the complete ODBMS.org
website with useful articles, downloads and industry information, please click here.

Jan 13 14

Big Data: Three questions to InterSystems.

by Roberto V. Zicari

“The absence of a schema has some flexibility advantages, although for querying the data, the absence of a schema presents some challenges to people accustomed to a classic RDBMS. “–Iran Hutchinson.

I start this new year with a new series of short interviews to leading vendors of Big Data technologies. I call them “Big Data: three questions to“. The first of such interviews is with Iran Hutchinson, Big Data Specialist at InterSystems.

RVZ

Q1. What is your current “Big Data” products offering?

Iran Hutchinson: InterSystems has actually been in the Big Data business for some time, since 1978, long before anyone called it that. We currently offer an integrated database, integration and analytics platform based on InterSystems Caché®, our flagship product, to enable Big Data breakthroughs in a variety of industries.

Launched in 1997, Caché is an advanced object database that provides in-memory speed with persistence, and the ability to ingest huge volumes of transactional data at insanely high velocity. It is massively scalable, because of its very lean design. Its efficient multidimensional data structures require less disk space and provide faster SQL performance than relational databases. Caché also provides sophisticated analytics, enabling real-time queries against transactional data with minimal maintenance and hardware requirements.

InterSystems Ensemble® is our seamless platform for integrating and developing connected applications. Ensemble can be used as a central processing hub or even as backbone for nationwide networks. By integrating this connectivity with our high-performance Caché database, as well as with new technologies for analytics, high-availability, security, and mobile solutions, we can deliver a rock-solid and unified Big Data platform, not a patchwork of disparate solutions.

We also offer additional technologies built on our integrated platform, such as InterSystems HealthShare®, a health informatics platform that enables strategic interoperability and analytics for action. Our TrakCare unified health information system is likewise built upon this same integrated framework.

Q2. Who are your current customers and how do they typically use your products?

Iran Hutchinson: We continually update our technology to enable customers to better manage, ingest and analyze Big Data. Our clients are in healthcare, financial services, aerospace, utilities – industries that have extremely demanding requirements for performance and speed. For example, Caché is the world’s most widely used database in healthcare. Entire countries, such as Sweden and Scotland, run their national health systems on Caché, as well as top hospitals and health systems around the world. One client alone runs 15 percent of the world’s equity trades through InterSystems software, and all of the top 10 banks use our products.

It is also being used by the European Space Agency to map a billion stars – the largest data processing task in astronomy to date. (See The Gaia Mission One Year Later.)

Our configurable ACID (Atomicity Consistency Isolation Durability) capabilities and ECP-based approach enable us to handle these kinds of very large-scale, very high-performance, transactional Big Data applications.

Q3. What are the main new technical features you are currently working on and why?

Iran Hutchinson: There are several new paradigms we are focusing on, but let’s focus on analytics. Once you absorb all that Big Data, you want to run analytics. And that’s where the three V’s of Big Data – volume, velocity and variety – are critically important.

Let’s talk about the variety of data. Most popular Big Data analytics solutions start with the assumption of structured data – rows and columns – when the most interesting data is unstructured, or text-based data. A lot of our competitors still struggle with unstructured data, but we solved this problem with Caché in 1997, and we keep getting better at it. InterSystems Caché offers both vertical and horizontal scaling, enabling schema-less and schema-based (SQL) querying options for both structured and unstructured data.
As a result, our clients today are running analytics on all their data – and we mean real-time, operational data, not the data that is aggregated a week later or a month later for boardroom presentations.

A lot of development has been done in the area of schema-less data stores or so-called document stores, which are mainly key-value stores. The absence of a schema has some flexibility advantages, although for querying the data, the absence of a schema presents some challenges to people accustomed to a classic RDBMS. Some companies now offer SQL querying on schema-less data stores as an add-on or plugin. InterSystems Caché provides a high-performance key-value store with native SQL support.

The commonly available SQL-based solutions also require a predefinition of what the user is interested in. But if you don’t know the data, how do you know what’s interesting? Embedded within Caché is a unique and powerful text analysis technology, called iKnow, that analyzes unstructured data out of the box, without requiring any predefinition through ontologies or dictionaries. Whether it’s English, German, or French, iKnow can automatically identify concepts and understand their significance – and do that in real-time, at transaction speeds.

iKnow enables not only lightning-fast analysis of unstructured data, but also equally efficient Google-like keyword searching via SQL with a technology called iFind.
And because we married that iKnow technology with another real-time OLAP-type technology we call DeepSee, we make it possible to embed this analytic capability into your applications. You can extract complex concepts and build cubes on both structured AND unstructured data. We blend keyword search and concept discovery, so you can express a SQL query and pull out both concepts and keywords on unstructured data.

Much of our current development activity is focused on enhancing our iKnow technology for a more distributed environment.
This will allow people to upload a data set, structured and/or unstructured, and organize it in a flexible and dynamic way by just stepping through a brief series of graphical representation of the most relevant content in the data set. By selecting, in the graphs, the elements you want to use, you can immediately jump into the micro-context of these elements and their related structured and unstructured information objects. Alternately, you can further segment your data into subsets that fit the use you had in mind. In this second case, the set can be optimized by a number of classic NLP strategies such as similarity extension, typicality pattern parallelism, etc. The data can also be wrapped into existing cubes or into new ones, or fed into advanced predictive models.

So our goal is to offer our customers a stable solution that really uses both structured and unstructured data in a distributed and scalable way. We will demonstrate the results of our efforts in a live system at our next annual customer conference, Global Summit 2014.

We also have a software partner that has built a very exciting social media application, using our analytics technology. It’s called Social Knowledge, and it lets you monitor what people are saying on Twitter and Facebook – in real-time. Mind you, this is not keyword search, but concept analysis – a very big difference. So you can see if there’s a groundswell of consumer feedback on your new product, or your latest advertising campaign. Social Knowledge can give you that live feedback – so you can act on it right away.

In summary, today InterSystems provides SQL and DeepSee over our shared data architecture to do structured data analysis.
And for unstructured data, we offer iKnow semantic analysis technology and iFind, our iKnow-powered search mechanism, to enable information discovery in text. These features will be enabled for text analytics in future versions of our shared-nothing data architectures.

Related Posts

- The Gaia mission, one year later. Interview with William O’Mullane.
ODBMS Industry Watch, January 16, 2013

- Operational Database Management Systems. Interview with Nick Heudecker. ODBMS Industry Watch, December 16, 2013.

- Challenges and Opportunities for Big Data. Interview with Mike Hoskins. ODBMS Industry Watch, December 3, 2013.

- On Analyzing Unstructured Data. — Interview with Michael Brands.
ODBMS Industry Watch, July 11, 2012.

Resources

ODBMS.org: Big Data Analytics, NewSQL, NoSQL, Object Database Vendors –Free Resources.

ODBMS.org: Big Data and Analytical Data Platforms, NewSQL, NoSQL, Object Databases– Free Downloads and Links.

ODBMS.org: Expert Articles.

Follow ODBMS.org on Twitter: @odbmsorg

##

Dec 16 13

Operational Database Management Systems. Interview with Nick Heudecker

by Roberto V. Zicari

“Going forward, we see the bifurcation between relational and NoSQL DBMS markets diminishing over time.”–Nick Heudecker.

Gartner recently published a new report on “Operational Database Management Systems”. I have interviewed one of the co-authors of the report, Nick Heudecker, Research Director – Information Management at Gartner, Inc.

Happy Holidays and Happy New Year to you and yours!


RVZ

Q1. You co-authored Gartner’s new report, “Magic Quadrant for Operational Database Management Systems”. How do you define “Operational Database Management Systems” (ODBMS)?

Nick Heudecker: Prior to operational DBMS, the common label for databases was OLTP. However, OLTP no longer accurately describes the range of activities an operational DBMS is called on to support. Additionally, mobile and social, elements of Gartner’s Nexus of Forces, have created new activity types which we broadly classify as interactions and observations. Supporting these new activity types has resulted in new vendors entering the market to compete with established vendors. Also, OLTP is no longer valid as all transactions are on-line.

Q2. What were the main evaluation criteria you used for the “Magic Quadrant for Operational Database Management Systems” report?

Nick Heudecker: The primary evaluation criteria for any Magic Quadrant consists of customer reference surveys. Vendors are also evaluated on market understanding, strategy, offerings, business model, execution, and overall viability.

Q3. To be included in the Magic Quadrant, what were the criteria that vendors and products had to meet?

Nick Heudecker: To be included in the Magic Quadrant, vendors had to have at least ten customer references, meet a minimum revenue number and meet our definition
of the market.

Q4. What is new in the last year in the Operational Database Management Systems area, in your view? What is changing?

Nick Heudecker: Innovations in the operational DBMS area have developed around flash memory, DRAM improvements, new processor technology, networking and appliance form factors. Flash memory devices have become faster, larger, more reliable and cheaper. DRAM has become far less costly and grown in size to greater than 1TB available on a server.
This has not only enabled larger disk caching, but also led to the development and wider use of in-memory DBMSs. New processor technology not only enables better DBMS performance in physically smaller servers, but also allows virtualization to be used for multiple applications and the DBMS on the same server. With new methods of interconnect such as 10-gigabit Ethernet and Infiniband, the connection between the storage systems and the DBMS software on the server is far faster. This has also increased performance and allowed for larger storage in a smaller space and faster interconnect for distributed data in a scale-out architecture. Finally, DBMS appliances are beginning to gain acceptance.

Q5. You also co-authored Gartner’s “Who’s Who in NoSQL Databases” report back in August. What is the current status of the NoSQL market in your opinion?

Nick Heudecker: There is a substantial amount of interest in NoSQL offerings, but also a great deal of confusion related to use cases and how vendor offerings are differentiated.
One question we get frequently is if NoSQL DBMSs are viable candidates to replace RDBMSs. To date, NoSQL deployments have been overwhelmingly supplemental to traditional relational DBMS deployments, not destructive.

Q6. How does the NoSQL market relate to the Operational Database Management Systems market?

Nick Heudecker: First, it’s difficult to define a NoSQL market. There are four distinct categories of NoSQL DBMS (document, key-value, table-style and graph), each with different capabilities and addressable use cases. That said, the various types of NoSQL DBMSs are included in the operational DBMS market based on capabilities around interactions and observations.

Q8. What do you see happening with Operational Database Management Systems, going forward?

Nick Heudecker: Going forward, we see the bifurcation between relational and NoSQL DBMS markets diminishing over time.

——————

Nick Heudecker is a Research Director for Gartner Inc, covering information management topics and specializing in Big Data and NoSQL.
Prior to Gartner, Mr. Heudecker worked with several Bay Area startups and developed an enterprise software development consulting practice. He resides in Silicon Valley.

—————————-
Resources

Gartner, “Magic Quadrant for Operational Database Management Systems,” by Donald Feinberg, Merv Adrian, and Nick Heudecker, October 21, 2013

Access the full Gartner Magic Quadrant report (via MarkLogic web site- registration required).

Access the full Gartner Magic Quadrant report (via Aerospike web site- registration required)

Related Posts

-On Big Data and NoSQL. Interview with Renat Khasanshyn. October 7, 2013

-On NoSQL. Interview with Rick Cattell. August 19, 2013

Follow ODBMS.org on Twitter: @odbmsorg

##

Dec 3 13

Challenges and Opportunities for Big Data. Interview with Mike Hoskins

by Roberto V. Zicari

“We are facing an imminent torrent of machine generated data, creating volumes that will break the back of conventional hardware and software architectures. It is no longer be feasible to move the data to the compute process – the compute process has to be moved to the data” –Mike Hoskins.

On the topic, Challenges and Opportunities for Big Data, I have interviewed Mike Hoskins, Actian Chief Technology Officer.

RVZ

Q1. What are in your opinion the most interesting opportunities in Big Data?

Mike Hoskins: Until recently, most data projects were solely focused on preparation. Seminal developments in the big data landscape, including Hortonworks Data Platform (HDP) 2.0 and the arrival of YARN (Yet Another Resource Negotiator) – which takes Hadoop’s capabilities in data processing beyond the limitations of the highly regimented and restrictive MapReduce programming model – provides an opportunity to move beyond the initial hype of big data and instead towards the more high-value work of predictive analytics.
As more big data applications are built on the Hadoop platform customized by industry and business needs, we’ll really begin to see organizations leveraging predictive analytics across the enterprise – not just in a sandbox or in the domain of the data scientists, but in the hands of the business users. At that point, more immediate action can be taken on insights.

Q2. What are the most interesting challenges in Big Data?

Mike Hoskins: We are facing an imminent torrent of machine generated data, creating volumes that will break the back of conventional hardware and software architectures. It is no longer be feasible to move the data to the compute process – the compute process has to be moved to the data. Companies need to rethink their static and rigid business intelligence and analytic software architectures in order to continue working at the speed of business. It’s clear that time has become the new gold standard – you can’t produce more of it; you can only increase the speed at which things happen.
Software vendors with the capacity to survive and thrive in this environment will keep pace with the competition by offering a unified platform, underpinned by engineering innovation, completeness of solution and the service integrity and customer support that is essential to market staying power.

Q3. Steve Shine, CEO and President, Actian Corporation, said in a recent interview (*) that “the synergies in data management come not from how the systems connect but how the data is used to derive business value”. Actian has completed a number of acquisitions this year. So, what is your strategy for Big Data at Actian?

Mike Hoskins: Actian has placed its bets on a completely modern unified platform that is designed to deliver on the opportunities presented by the Age of Data. Our technology assets bring a level of maturity and innovation to the space that no other technology vendor can provide – with 30+ years of expertise in ‘all things data’ and over $1M investment in innovation. Our mission is to arm organizations with solutions that irreversibly shift the price/performance curve beyond the reach of traditional legacy stack players, allowing them to get a leg up on the competition, retain customers, detect fraud, predict business trends and effectively use data as their most important asset.

Q4. What are the products synergies related to such a strategy?

Mike Hoskins: Through the acquisition of Pervasive Software (a provider of big data analytics and cloud-based and on-premises data management and integration), Versant (an industry leader in specialized data management), and ParAccel (a leader in high-performance analytics), Actian has compiled a unified end-to-end platform with capabilities to connect, prep, optimize and analyze data natively on Hadoop, and then offer it to the necessary reporting and analytics environments to meet virtually any business need. All the while, operating on commodity hardware at a much lower cost than legacy software can ever evolve to.

Q5. What else still need to be done at Actian to fully deploy this strategy?

Mike Hoskins: There are definitely opportunities to continue integrating the platform experience and improve the user experience overall. Our world-class database technology can be brought closer to Hadoop, and we will continue innovating on analytic techniques to grow our stack upward.
Our development team is working diligently to create a common user interface across all of our platforms, as we bring out technology together. We have the opportunity to create a true first-class SQL engine running natively Hadoop, and to more fully exploit market leading cooperative computing with our On-Demand Integration (ODI) capabilities. I would also like to raise the awareness of the power and speed of our offerings as a general paradigm for analytic applications.

We don’t know what new challenges the Age of Data will bring, but we will continue to look to the future and build out a technology infrastructure to help organizations deal with the only constant – change.

Q6. What about elastic computing in the Cloud? How does it relate to Big Data Analytics?

Mike Hoskins: Elastic cloud computing is a convulsive game changer in the marketplace. It’s positive; if not where you do full production, at the very least it allows people to test, adopt and experiment with their data in a way that they couldn’t before. For cases where data is born in the cloud, using a 100% cloud model makes sense. However, much data is highly distributed in cloud and on-premises systems and applications, so it’s vital to have technology that can run and connect to either environments via a hybrid model.

We will soon see more organizations utilizing cloud platforms to run analytic processes, if that is where their data is born and lives.

Q7. How is your Cloud technology helping Amazon`s Redshift?

Mike Hoskins: Amazon Redshift leverages our high-performance analytics database technology to help users get the most out of their cloud investment. Amazon selected our technology over all other database and data warehouse technologies available in the marketplace because of the incredible performance, extreme scalability, and flexibility.

Q8. Hadoop is still quite new for many enterprises, and different enterprises are at different stages in their Hadoop journey.
When you speak with your customers what are the typical use cases and requirements they have?

Mike Hoskins: A recent survey of data architects and CIOs by Sand Hill Group revealed that the top challenge of Hadoop adoption was knowledge and experience with the Hadoop platform, followed by the availability of Hadoop and big data skills, and finally the amount of technology development required to implement a Hadoop-based solution. This just goes to show how little we have actually begun to fully leverage the capabilities of Hadoop. Businesses are really only just starting to dip their toe in the analytic water. Although it’s still very early, the majority of use cases that we have seen are centered around data prep and ETL.

Q9. What do you think is still needed for big data analytics to be really useful for the enterprise?

Mike Hoskins: If we look at the complete end-to-end data pipeline, there are several things that are still needed for enterprises to take advantage of the opportunities. This includes high productivity, performant integration layers, and analytics that move beyond the sphere of data science and into mainstream business usage, with discovery analytics through a simple UI studio or an analytics-as-a-service offering. Analytics need to be made more available in the critical discovery phase, to bring out the outcomes, patterns, models, discoveries, etc. and begin applying them to business processes.

Qx. Anything else you wish to add?

Mike Hoskins: These kinds of highly disruptive periods are, frankly, unnerving for the marketplace and businesses. Organizations cannot rely on traditional big stack vendors, who are unprepared for the tectonic shift caused by big data, and therefore are not agile enough to rapidly adjust their platforms to deliver on the opportunities. Organizations are forced to embark on new paths and become their own System Integrators (SIs).

On the other hand, organizations cannot tie their future to the vast number of startups, throwing darts to find the one vendor that will prevail. Instead, they need a technology partner somewhere in the middle that understands data in-and-out, and has invested completely and wholly as a dedicated stack to help solve the challenge.

Although it’s uncomfortable, it is urgent that organizations look at modern architectures, next-generation vendors and innovative technology that will allow them to succeed and stay competitive in the Age of Data.

—————————–
Mike Hoskins, Actian Chief Technology Officer
Actian CTO Michael Hoskins directs Actian’s technology innovation strategies and evangelizes accelerating trends in big data, and cloud-based and on-premises data management and integration. Mike, a Distinguished and Centennial Alumnus of Ohio’s Bowling Green State University, is a respected technology thought leader who has been featured in TechCrunch, Forbes.com, Datanami, The Register and Scobleizer. Mike has been a featured speaker at events worldwide, including Strata NY + Hadoop World 2013, the keynoter at DeployCon 2012, the “Open Standards and Cloud Computing” panel at the Annual Conference on Knowledge Discovery and Data Mining, the “Scaling the Database in the Cloud” panel at Structure 2010, and the “Many Faces of Map Reduce – Hadoop and Beyond” panel at Structure Big Data 2011. Mike received the AITP Austin chapter’s 2007 Information Technologist of the Year Award for his leadership in developing Actian DataRush, a highly parallelized framework to leverage multicore. Follow Mike on Twitter: @MikeHSays.

Related Posts

-Big Data Analytics at Thomson Reuters. Interview with Jochen L. Leidner. November 15, 2013

- On Big Data. Interview with Adam Kocoloski. November 5, 2013

- Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett. September 23, 2013

(*) Acquiring Versant –Interview with Steve Shine. March 6, 2013

Resources

- “Do You Hadoop? A Survey of Big Data Practitioners”, Bradley Graham M. R. Rangaswami, SandHill Group, October 29, 2013 (.PDF)

-ActianVectorwise 3.0: Fast Analytics and Answers from Hadoop. Actian Corporation
Paper | Technical | English | DOWNLOAD(PDF)| May 2013|

Nov 15 13

Big Data Analytics at Thomson Reuters. Interview with Jochen L. Leidner

by Roberto V. Zicari

“My experience overall with almost all open-source tools has been very positive: open source tools are very high quality, well documented, and if you get stuck there is a helpful and responsive community on mailing lists or Stack Exchange and similar sites.” —Dr. Jochen L. Leidner.

I wanted to know how Thomson Reuters uses Big Data. I have interviewed Dr. Jochen L. Leidner, Lead Scientist, of the London R&D at Thomson Reuters.

RVZ

Q1. What is your current activity at Thomson Reuters?
Jochen L. Leidner: For the most part, I carry out applied research in information access, and that’s what I have been doing for quite a while. After five years with the company – I joined from the University of Edinburgh, where I had been a postdoctoral Royal Society of Edinburgh Enterprise Fellow half a decade ago – I am currently a Lead Scientist with Thomson Reuters, where I am building up a newly-established London site part of our Corporate Research & Development group (by the way: we are hiring!). Before that, I held research and innovation-related roles in the USA and in Switzerland.
Let me say a few words about Thomson Reuters before I go more into my own activities, just for background. Thomson Reuters has around 50,000 employees in over 100 countries and sells information to professionals in many verticals, including finance & risk, legal, intellectual property & scientific, tax & accounting.
Our headquarters are located at 3 Time Square in the city of New York, NY, USA.
Most people know our REUTERS brand from reading their newspapers (thanks to our highly regarded 3,000+ journalists at news desks in about 100 countries, often putting their lives at risk to give us reliable reports of the world’s events) or receiving share price information on the radio or TV, but as a company, we are also involved in as diverse areas as weather prediction (as the weather influences commodity prices) and determining citation impact of academic journals (which helps publishers sell their academic journals to librarians), or predicting Nobel prize winners.
My research colleagues and I study information access and especially means to improve it, using including natural language processing, information extraction, machine learning, search engine ranking, recommendation system and similar areas of investigations.
We carry out a lot of contract research for internal business units (especially if external vendors do not offer what we need, or if we believe we can build something internally that is lower cost and/or better suited to our needs), feasibility studies to de-risk potential future products that are considered, and also more strategic, blue-sky research that anticipates future needs. As you would expect, we protect our findings and publish them in the usual scientific venues.

Q2. Do you use Data Analytics at Thomson Reuters and for what?
Jochen L. Leidner: Note that terms like “analytics” are rather too broad to be useful in many instances; but the basic answer is “yes”, we develop, apply internally, and sell as products to our customers what can reasonably be called solutions that incorporate “data analytics” functions.
One example capability that we developed is the recommendation engine CaRE (Al-Kofahi et al., 2007), which was developed by our group, Corporate Research & Development, which is led by our VP of Research & Development, Khalid al-Kofahi. This is a bit similar in spirit to Amazon’s well-known book recommendations. We used it to recommend legal documents to attorneys, as a service to supplement the user’s active search on our legal search engine with “see also..”-type information. This is an example of a capability developed in-house that also something that made it into a product, and is very popular.
Thomson Reuters is selling information services, often under a subscription model, and for that it is important to have metrics available that indicate usage, in order to inform our strategy. So another example for data analytics is that we study how document usage can inform personalization and ranking, of from where documents are accessed, and we use this to plan network bandwidth and to determine caching server locations.
A completely different example is citation information: Since 1969, when our esteemed colleague Eugene Garfield (he is now officially retired, but is still active) came up with the idea of citation impact, our Scientific business division is selling the journal citation impact factor – an analytic that can be used as a proxy for the importance of a journal (and, by implication, as an argument to a librarian to purchase a subscription of that journal for his or her university library).
Or, to give another example from the financial markets area, we are selling predictive models (Starmine) that estimate how likely it is whether a given company goes bankrupt within the next six months.

Q3. Do you have Big Data at Thomson Reuters? Could you please give us some examples of Big Data Use Cases at your company?
Jochen L. Leidner: For most definitions of “big”, yes we do. Consider that we operate a news organization, which daily generates in the tens of thousands of news reports (if we count all languages together). Then we have photo journalists who create large numbers of high-quality, professional photographs to document current events visually, and videos comprising audio-visual storytelling and interviews. We further collect all major laws, statutes, regulations and legal cases around in major jurisdictions around the world, enrich the data with our own meta-data using both manual expertise and automatic classification and tagging tools to enhance findability. We hold collections of scientific articles and patents in full text and abstracts.
We gather, consolidate and distribute price information for financial instruments from hundreds of exchanges around the world. We sell real-time live feeds as well as access to decades of these time series for the purpose of back-testing trading strategies.

Q4. What “value” can be derived by analyzing Big Data at Thomson Reuters?
Jochen L. Leidner: This is the killer question: we take the “value” very much without the double quotes – big data analytics lead to cost savings as well as generate new revenues in a very real, monetary sense of the word “value”. Because our solutions provide what we call “knowledge to act” to our customers, i.e., information that lets them make better decisions, we provide them with value as well: we literally help our customers save the cost of making a wrong decision.

Q5. What are the main challenges for big data analytics at Thomson Reuters ?
Jochen L. Leidner: I’d say these are absolute volume, growth, data management/integration, rights management, and privacy are some of the main challenges.
One obvious challenge is the size of the data. It’s not enough to have enough persistent storage space to keep it, we also need backup space, space to process it, caches and so on – it all adds up. Another is the growth and speed of that growth of the data volume. You can plan for any size, but it’s not easy to adjust your plans if unexpected growth rates come along.
Another challenge that we must look into is the integration between our internal data holdings, external public data (like the World Wide Web, or commercial third-party sources – like Twitter – which play an important role in the modern news ecosystem), and customer data (customers would like to see their own internal, proprietary data be inter-operable with our data). We need to respect the rights associated with each data set, as we deal with our own data, third party data and our customers’ data. We must be very careful regarding privacy when brainstorming about the next “big data analytics” idea – we take privacy very seriously and our CPO joined us from a government body that is in charge of regulating privacy.

Q6. How do you handle the Big Data Analytics “process” challenges with deriving insight?
Jochen L. Leidner: Usually analytics projects happen as an afterthought to leverage existing data created in a legacy process, which means not a lot of change of process is needed at the beginning. This situation changes once there is resulting analytics output, and then the analytics-generating process needs to be integrated with the previous processes.
Even with new projects, product managers still don’t think of analytics as the first thing to build into a product for a first bare-bones version, and we need to change that; instrumentation is key for data gathering so that analytics functionality can build on it later on.
In general, analytics projects follow a process of (1) capturing data, (2) aligning data from different sources (e.g., resolving when two objects are the same), (3) pre-processing or transforming the data into a form suitable for analysis, (4) building some model and (5) understanding the output (e.g. visualizing and sharing the results). This five-step process is followed by an integration phase into the production process to make the analytics repeatable.

Q7. What kind of data management technologies do you use? What is your experience in using them?
Jochen L. Leidner: For storage and querying, we use relational database management systems to store valuable data assets and also use NoSQL databases such as CouchDB and MongoDB in projects where applicable. We use homegrown indexing and retrieval engines as well as open source libraries and components like Apache Lucene, Solr and ElasticSearch.
We use parallel, distributed computing platforms such as Hadoop/Pig and Spark to process data, and virtualization to manage isolation of environments.
We sometimes push out computations to Amazon’s EC2 and storage to S3 (but this can only be done if the data is not sensitive). And of course in any large organization there is a lot of homegrown software around. My experience overall with almost all open-source tools has been very positive: open source tools are very high quality, well documented, and if you get stuck there is a helpful and responsive community on mailing lists or Stack Exchange and similar sites.

Q8. Do you handle un-structured data? If yes, how?
Jochen L. Leidner: We curate lots of our own unstructured data from scratch, including the thousands of news stories that our over three thousand REUTERS journalists write every day, and we also enrich the unstructured content produced by others: for instance, in the U.S. we manually enrich legal cases with human-written summaries (written by highly-qualified attorneys) and classify them based on a proprietary taxonomy, which informs our market-leading WestlawNext product (see this CIKM 2011 talk), our search engine for the legal profession, who need to find exactly the right cases. Over the years, we have developed proprietary content repositories to manage content storage, meta-data storage, indexing and retrieval. One of our challenges is to unite our data holdings, which often come from various acquisitions that use their own technology.

Q9. Do you use Hadoop? If yes, what is your experience with Hadoop so far?
Jochen L. Leidner: We have two clusters featuring Hadoop, Spark and GraphLab, and we are using them intensively. Hadoop is mature as an implementation of the MapReduce computing paradigm, but has its shortcomings because it is not a true operating system (but probably it should be) – for instance regarding the stability of HDFS, its distributed file system. People have started to realize there are shortcomings, and have started to build other systems around Hadoop to fill gaps, but these are still early stage, so I expect them to become more mature first and then there might be a wave of consolidation and integration. We have definitely come a long way since the early days.
Typically, on our clusters we run batch information extraction tasks, data transformation tasks and large-scale machine learning processes to train classifiers and taggers for these. We are also inducing language models and training recommender systems on them. Since many training algorithms are iterative, Spark can win over Hadoop for these, as it keeps models in RAM.

Q10. Hadoop is a batch processing system. How do you handle Big Data Analytics in real time (if any)?
Jochen L. Leidner: The dichotomy is perhaps between batch processing and dialog processing, whereas real-time (“hard”, deterministic time guarantee for a system response) goes hand-in-hand with its opposite non-real time, but I think what you are after here is that dialog systems have to be responsive. There is no one-size-fits-all method for meeting (near) real-time requirements or for making dialog systems more responsive; a lot of analytics functions require the analysis of more than just a few recent data-points, so if that’s what is needed it may take its time. But it is important that critical functions, such as financial data feeds, are delivered as fast as possible – micro-seconds matter here. The more commoditized analytics functions become, the faster they need to be available to retain at least speed as a differentiator.

Q11 Cloud computing and open source: Do you they play a role at Thomson Reuters? If yes, how?
Jochen L. Leidner: Regarding cloud computing, we use cloud services internally and as part of some of our product offerings. However, there are also reservations – a lot of our applications contain information that is too sensitive to entrust a third party, especially as many cloud vendors cannot give you a commitment with respect to hosting (or not hosting) in particular jurisdictions. Therefore, we operate our own set of data centers, and some part of these operates as what has become known as “private clouds”, retaining the benefit of the management outsourcing abstraction, but within our large organization rather than pushing it out to a third party. Of course the notion of private clouds is leading the cloud idea ad absurdum quite a bit, because it sacrifices the economy of scale, but having more control is an advantage.
Open source plays a huge role at Thomson Reuters – we rely on many open source components, libraries and systems, especially under the MIT, BSD, LGPL and Apache licenses. For example, some of our tagging pipelines rely on Apache UIMA, which is a contribution originally developed at IBM, and which has seen contributions by researchers from all around the world (from Darmstadt to Boulder). To date, we have not been very good about opening up our own services in the form of source code, but we are trying to change that now, and we have just launched a new corporation-wide process for open-sourcing software. We also have an internal sharing repository, “Corporate Source”, but in my personal view the audience in any single company is too small – open source (like other recent waves such as clouds or crowdsourcing) needs Internet-scale to work, and Web search engines for the various projects to be discovered).

Q12 What are the main research challenges ahead? And what are the main business challenges ahead?
Jochen L. Leidner: Some of the main business challenges are the cost pressure that some of our customers face, and the increasing availability of low-cost or free-of-charge information sources, i.e. the commoditization of information. I would caution here that whereas the amount of information available for free is large, this in itself does not help you if you have a particular problem and cannot find the information that helps you solve it, either because the solution is not there despite the size, or because it is there but findability is low. Further challenges include information integration, making systems ever more adaptive, but only to the extent it is useful, or supporting better personalization. Having said this sometimes systems need to be run in a non-personalized mode (e.g. in the field of e-discovery, you need to have a certain consistency, namely that the same legal search systems retrieves the same things today and tomorrow, and to different parties.

Q13 Anything else you wish to add?
Jochen L. Leidner: I would encourage decision makers of global companies not to get misled by fast-changing “hype” language: words like “cloud”, “analytics” and “big data” are too general to inform a professional discussion. Putting my linguist’s hat on, I can only caution about the lack of precision inherent in marketing language’s use in technological discussions: for instance, cluster computing is not the same as grid computing. And what looks like “big data” today we will almost certainly carry around with us on a mobile computing device tomorrow. Also, buzz words like “Big Data” do not by themselves solve any problem – they are not magic bullets. To solve any problem, look at the input data, specify the desired output data, and think hard about whether and how you can compute the desired result – nothing but “good old” computer science.

Dr. Jochen L. Leidner is a Lead Scientist with Thomson Reuters, where he is building up the corporation’s London site of its Research & Development group.
He holds Master’s degrees in computational linguistics, English language and literature and computer science from Friedrich-Alexander University Erlangen-Nuremberg and in computer speech, text and internet technologies from the University of Cambridge, as well as a Ph.D. in information extraction from the University of Edinburgh (“Toponym resolution in Text”). He is recipient of the first ACM SIGIR Doctoral Consortium Award, a Royal Society of Edinburgh Enterprise Fellowship in Electronic Markets, and two DAAD scholarships.
Prior to his research career, he has worked as a software developer, including for SAP AG (basic technology and knowledge management) as well as for startups.
He led the development teams of multiple question Answering systems, including the systems QED at Edinburgh and Alyssa at Saarland University, the latter of which ranked third at the last DARPA/NIST TREC open-domain question answering factoid track evaluation.
His main research interests include information extraction, question answering and search, geo-spatial grounding, applied machine learning with a focus on methodology behind research & development in the area of information access.
In 2013, Dr. Leidner has also been teaching an invited lecture course “Language Technology and Big Data” at the University of Zurich, Switzerland.

—————————-
Related Posts

-Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett. September 23, 2013

-On Linked Data. Interview with John Goodwin. September 1, 2013

-Big Data Analytics at Netflix. Interview with Christos Kalantzis and Jason Brown. February 18, 2013

Resources

- ODBMS.org free resources on Big Data and Analytical Data Platforms:
Blog Posts | Free Software| Articles | Lecture Notes | PhD and Master Thesis|

Follow us on Twitter: @odbmsorg

Nov 5 13

On Big Data. Interview with Adam Kocoloski.

by Roberto V. Zicari

” The pace that we can generate data will outstrip our ability to store it.
I think you’ll soon see data scientists emphasizing the ability to make decisions on data before storing it ” –Adam Kocoloski.

I have interviewed Adam Kocoloski, Founder & CTO of Cloudant.

RVZ

Q1. What can we learn from physics when managing and analyzing big data for the enterprise?

Adam Kocoloski: The growing body of data collected in today’s Web applications and sensor networks is a potential goldmine for businesses. But modeling transactions between people and causality between events becomes challenging at large scale, and traditional enterprise systems like data warehousing and business intelligence are too cumbersome to extract value fast enough.

Physicists are natural problem solvers, equipped to think through what tools will work for particular data challenges. In the era of big data, these challenges are growing increasingly relevant, especially to the enterprise.

In a way, physicists have it easier. Analyzing isolated particle collisions translated well to distributed university research systems and parallel models of computing. In other ways, we have shared the challenge of filtering big data to find useful information. In my physics work, we addressed this problem with blind analysis and machine learning. I think you’ll soon see those practices emerge in the field of enterprise data analysis.

Q2. How do you see data science evolving in the near future?

Adam Kocoloski: The pace that we can generate data will outstrip our ability to store it. I think you’ll soon see data scientists emphasizing the ability to make decisions on data before storing it.

The sheer volume of data we’re storing is a factor, but what’s more interesting is the shift toward the distributed generation of data — data from mobile devices, sensor networks, and the coming “Internet of Things.” It’s easy for an enterprise to stand up Hadoop in its own data center and start dumping data into it, especially if it plans to sort out the valuable parts later. It’s not so easy when it’s large volumes of operational data generated in a distributed system. Machine learning algorithms that can recognize and store only the useful patterns can help us better deal with the deluge.

As physicists, we learned that the way big data is headed, there’s no way we’ll be able to keep writing it all down. That’s the tradeoff today’s data scientists must learn: right when you collect the data, you need to make decisions on throwing it away.

Q3. In your opinion, given the current available Big Data technologies, what is the most difficult challenge in filtering big data to find useful information?

Adam Kocoloski: Cloudant is an operational data store and not a big data or offline analytics platform like Hadoop. That means we deal with mutable data that applications are accessing and changing as they run.

From my physics experience, the most difficult big data challenge I’ve seen is the lack of accurate simulations for machine learning. For me, that meant simulations of the STAR particle detector at Brookhaven National Lab’s Relativistic Heavy Ion Collider (RHIC).

People use machine learning algorithms in many fields, and they don’t always understand the caveats of building in an appropriate training data set. It’s easy to apply training data without fully understanding how the process works. If they do that, they won’t realize when they’ve trained their machine learning algorithms inappropriately.

Slicing data from big data sets is great, but at a certain point it becomes a black box that makes it hard to understand what is and what isn’t working well in your analysis. The bigger the data, the more it’s possible for one variable to be related to others in nonlinear ways. This problem makes it harder to reason about data, placing more demands on data scientists to build training data sets using a balanced combination of linear and nonlinear techniques.

Q4. Could you please explain why blind analyses is important for Big Data?

Adam Kocoloski: Humans are naturally predisposed to find signals. It’s an evolutionary trait of ours. It’s better if we recognize the tiger in the jungle, even if there really isn’t one there. If we see a bump in a distribution of data, we do what we can to tease it out. We bias ourselves that way.
So when you do a blind analysis, you hopefully immunize yourself against that bias.

Data scientists are people too, and with big data, they can’t become overly reliant on data visualization. It’s too easy for us to see things that aren’t really there. Instead of seeking out the signals within all that data, we need to work on recognizing the noise — the data we don’t want — so we can inversely select the data we want to keep.

Q5. Is machine learning the right way to analyze Big Data?

Adam Kocoloski: Machine learning offers the possibility to improve the signal-to-noise ratio beyond what any manually constructed analysis can do.
The potential is there, but you have to balance it with the need to understand the training data set. It’s not a panacea. Algorithms have weak points. They have places where they fail. When you’re applying various machine-learning analyses, it’s important that you understand where those weak points are.

Q6. The past year has seen a renaissance in NewSQL. Will transactions ultimately spell the end of NoSQL databases?

Adam Kocoloski: No — 1) because there’s a wide, growing class of problems that don’t require transactional semantics and 2) mobile computing makes transactions at large scale technically infeasible.

Applications like address books, blogs, or content management systems can store a wide variety of data, and, largely, do not require a high degree of transactional integrity. Using systems that inherently enforce schemas and row-level locking — like an relational database management system (RDBMS) — unnecessarily over-complicate these applications.

It’s widely thought that the popularity of NoSQL databases was due to the inability of relational databases to scale horizontally. If NewSQL databases can provide transactional integrity for large, distributed databases and cloud services, does this undercut the momentum of the NoSQL movement? I argue that no, it doesn’t, because mobile computing introduces new challenges (e.g. offline application data and database sync) that fundamentally cannot be addressed in transactional systems.

It’s unrealistic to lock a row in an RDBMS when a mobile device that’s only occasionally connected could introduce painful amounts of latency over unreliable networks. Add that to the fact that many NoSQL systems are introducing new behaviors (strong-consistency, multi-document transactions) and strategies for approximating ACID transactions (event sourcing) — mobile is showing us that we need to rethink the information theory behind it.

Q7. What is the technical role that CouchDB clustering plays for Cloudant’s distributed data hosting platform?

Adam Kocoloski: At Cloudant, clustering allows us to take one logical database and partition that database for large scale and high availability.
We also store redundant copies of the partitions that make up that cluster, and to our customers, it all looks and operates like one logical database. CouchDB’s interface naturally lends itself to this underlying clustering implementation, and it is one of the many technologies we have used to build Cloudant’s managed database service.

Cloudant is built to be more than just hosted CouchDB. Along with CouchDB, open source software projects like HAProxy, Lucene, Chef, and Graphite play a crucial role in running our service and managing the experience for customers. Cloudant is also working with organizations like the Open Geospatial Consortium (OGC) to develop new standards for working with geospatial data sets.

That said, the semantics of CouchDB replication — if not the actual implementation itself — are critical to Cloudant’s ability to synchronize individual JSON documents or entire database partitions between shard copies within a single cluster, between clusters in the same data center, and between data centers across the globe. We’ve been able to horizontally scale CouchDB and apply its unique replication abilities on a much larger scale.

Q8. Cloudant recently announced the merging of its distributed database platform into the Apache CouchDB project. Why? What are the benefits of such integration?

Adam Kocoloski: We merged the horizontal scaling and fault-tolerance framework we built in BigCouch into Apache CouchDB™. The same way Cloudant has applied CouchDB replication in new ways to adapt the database for large distributed systems, Apache CouchDB will now share those capabilities.

Previously, the biggest knock on CouchDB was that it couldn’t scale horizontally to distribute portions of a database across multiple machines. People saw it as a monolithic piece of software, only fit to run on a single server. That is no longer the case.

Obviously new scalability features are good for the Apache project, and a healthy Apache CouchDB is good for Cloudant. The open source community is an excellent resource for engineering talent and sales leads. Our contribution will also improve the quality of our code. Having more of it out there in live deployment will only increase the velocity of our development teams. Many of our engineers wear multiple hats — as Cloudant employees and Apache CouchDB project committers. With the code merger complete, they’ll no longer have to maintain multiple forks of the codebase.

Q9. Will there be two offerings of the same Apache CouchDB: one from Couchbase and one from Cloudant?

Adam Kocoloski: No. Couchbase has distanced itself from the Apache project. Their product, Couchbase Server, is no longer interface-compatible with Apache CouchDB and has no plans to become so.

———
Adam Kocoloski, Founder & CTO of Cloudant.
Adam is an Apache CouchDB developer and one of the founders of Cloudant. He is the lead architect of BigCouch, a Dynamo-flavored clustering solution for CouchDB that serves as the core of Cloudant’s distributed data hosting platform. Adam received his Ph.D. in Physics from MIT in 2010, where he studied the gluon’s contribution to the spin structure of the proton using a motley mix of server farms running Platform LSF, SGE, and Condor. He and his wife Hillary are the proud parents of two beautiful girls.

Related Posts

- Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett. September 23, 2013

- On NoSQL. Interview with Rick Cattell. August 19, 2013

- On Big Data Analytics –Interview with David Smith. February 27, 2013

Resources

- “NoSQL Failover Characteristics: Aerospike, Cassandra, Couchbase, MongoDB” (.pdf), by Denis Nelubin, Ben Engber, Thumbtack Technology, 2013

- “Ultra-High Performance NoSQL Benchmarking- Analyzing Durability and Performance Tradeoffs” (.pdf) by Denis Nelubin,, Ben Engber, Thumbtack Technology, 2013

Follow us on Twitter: @odbmsorg

Oct 28 13

On multi-model databases. Interview with Martin Schönert and Frank Celler.

by Roberto V. Zicari

“We want to prevent a deadlock where the team is forced to switch the technology in the middle of the project because it doesn’t meet the requirements any longer.”–Martin Schönert and Frank Celler.

On “multi-model” databases, I have interviewed Martin Schönert and Frank Celler, founders and creators of the open source ArangoDB.

RVZ

Q1. What is ArangoDB and for what kind of applications is it designed for?

Frank Celler: ArangoDB is a multi-model mostly-memory database with a flexible data model for documents and graphs. It is designed as a “general purpose database”, offering all the features you typically need for modern web applications.

ArangoDB is supposed to grow with the application—the project may start as a simple single-server prototype, nothing you couldn’t do with a relational database equally well. After some time, some geo-location features are needed and a shopping cart requires transactions. ArangoDB’s graph data model is useful for the recommendation system. The smartphone app needs a lean API to the back-end—this is where Foxx, ArangoDB’s integrated Javascript application framework, comes into play.
The overall idea is: “We want to prevent a deadlock where the team is forced to switch the technology in the middle of the project because it doesn’t meet the requirements any longer.”

ArangoDB is open source (Apache 2 licence)—you can get the source code at GitHub or download the precompiled binaries from our website.

Though ArangoDB as a universal approach, there are edge cases where we don’t recommend ArangoDB. Actually, ArangoDB doesn’t compete with massively distributed systems like Cassandra with thousands of nodes and many terabytes of data.

Q2. What’s so special about the ArangoDB data model?

Martin Schönert: ArangoDB is a multi-model database. It stores documents in collections. A specialized binary data file format is used for disk storage. Documents that have similar structure (i.e., that have the same attribute names and attribute types) can share their structural information. The structure (called “shape”) is saved just once, and multiple documents can re-use it by storing just a pointer to their “shape”.
In practice, documents in a collection are likely to be homogenous, and sharing the structure data between multiple documents can greatly reduce disk storage space and memory usage for documents.

Q3. Who is currently using ArangoDB for what?

Frank Celler: ArangoDB is open source. You don’t have to register to download the source code or precompiled binaries. As a user, you can get support via Google Group, GitHub’s issue tracker and even via Twitter. We are very amenable, which is an essential part of the project. The drawback is that we don’t really know what people are doing with ArangoDB in detail. We are noticing an exponentially increasing number of downloads over the last months.
We are aware of a broad range of use cases: a CMS, a high-performance logging component, a geo-coding tool, an annotation system for animations, just to name a few. Other interesting use cases are single page apps or mobile apps via Foxx, ArangoDB’s application framework. Many of our users have in-production experience with other NoSQL databases, especially the leading document stores.

Q4. Could you motivate your design decision to use Google’s V8 JavaScript engine?

Martin Schönert: ArangoDB uses Google’s V8 engine to execute server-side JavaScript functions. Users can write server-side business logic in JavaScript and deploy it in ArangoDB. These so-called “actions” are much like stored procedures living close to the data.
For example, with actions it is possible to perform cascading deletes/updates, assign permissions, and do additional calculations and modifications to the data.
ArangoDB also allows users to map URLs to custom actions, making it usable as an application server that handles client HTTP requests with user-defined business logic.
We opted for Javascript as it meets our requirements for an “embedded language” in the database context:
• Javascript is widely used. Regardless in which “back-end language” web developers write their code, almost everybody can code also in Javascript.
• Javascript is effective and still modern.
Just as well, we chose Google V8, as it is the fastest, most stable Javascript interpreter available for the time being.

Q5. How do you query ArangoDB if you don’t want to use JavaScript?

Frank Celler: ArangoDB offers a couple of options for getting data out of the database. It has a REST interface for CRUD operations and also allows “querying by example”. “Querying by example” means that you create a JSON document with the attributes you are looking for. The database returns all documents which look like the “example document”.
Expressing complex queries as JSON documents can become a tedious task—and it’s almost impossible to support joins following this approach. We wanted a convenient and easy-to-learn way to execute even complex queries, not involving any programming as in an approach based on map/reduce. As ArangoDB supports multiple data models including graphs, it was neither sufficient to stick to SQL nor to simply implement UNQL. We ended up with the “ArangoDB query language” (AQL), a declarative language similar to SQL and Jsoniq. AQL supports joins, graph queries, list iteration, results filtering, results projection, sorting, variables, grouping, aggregate functions, unions, and intersections.
Of course, ArangoDB also offers drivers for all major programming languages. The drivers wrap the mentioned query options following the paradigm of the programming language and/or frameworks like Ruby on Rails.

Q6. How do you perform graph queries? How does this differ from systems such as Neo4J?

Frank Celler: SQL can’t cope with the required semantics to express the relationships between graph nodes, so graph databases have to provide other ways to access the data.
The first option is to write small programs, so called “path traversals.” In ArangoDB, you use Javascript; in neo4j Java, the general approach is very similar.
Programming gives you all the freedom to do whatever comes to your mind. That’s good. For standard use cases, programming might be too much effort. So, both ArangoDB and neo4j offer a declarative language—neo4j has “Cypher,” ArangoDB the “ArangoDB Query Language.” Both also implement the blueprints standard so that you can use “Gremlin” as query-language inside Java. We already mentioned that ArangoDB is a multi-model database: AQL covers documents and graphs, it provides support for joins, lists, variables, and does much more.

The following example is taken from the neo4j website:

“For example, here is a query which finds a user called John in an index and then traverses the graph looking for friends of John’s friends (though not his direct friends) before returning both John and any friends-of-friends that are found.

START john=node:node_auto_index(name = ‘John’)
MATCH john-[:friend]->()-[:friend]->fof
RETURN john, fof ”

The same query looks in AQL like this:

FOR t IN TRAVERSAL(users, friends, “users/john”, “outbound”,
{minDepth: 2}) RETURN t.vertex._key

The result is:
[ "maria", "steve" ]

You see that Cypher describes patterns while AQL describes joins. Internally, ArangoDB has a library of graph functions—those functions return collections of paths and paths or use those collections in a join.

Q7. How did you design ArangoDB to scale out and/or scale up? Please give us some detail.

Martin Schönert: Solid state disks are becoming more and more a commodity hardware. ArangoDB’s append-only design is a perfect fit for such SSD, allowing for data-sets which are much bigger than the main memory but still fit unto a solid state disk.
ArangoDB supports master/slave replication in version 1.4 which will be released in the next days (a beta has been available for some time). On the one hand this provides easy fail-over setups. On the other hand it provides a simple way to scale the read-performance.
Sharding is implemented in version 2.0. This enables you to store even bigger data-sets and increase the write-performance. As noted before, however, we see our main application when scaling to a low number of nodes. We don’t plan to optimize ArangoDB for massive scaling with hundreds of nodes. Plain key/value stores are much more usable in such scenarios.

Q8. What is ArangoDB’s update and delete strategy?

Martin Schönert: ArangoDB versions prior to 1.3 store all revisions of documents in an append-only fashion; the objects will never be overwritten. The latest version of a document is available to the end user.

With the current version 1.3, ArangoDB introduces transactions and sets the technical fundament for replication and sharding. In the course of those highly wanted features comes “real” MVCC with concurrent writes.

In databases implementing an append-only strategy, obsolete versions of a document have to be removed to save space. As we already mentioned, ArangoDB is multi-threaded: The so-called compaction is automatically done in the background in a different thread without blocking reads and writes.

Q9. How does ArangoDB differ from other NoSQL data stores such as Couchbase and MongoDB and graph data stores such as Neo4j, to name a few?

Frank Celler: ArangoDB’s feature scope is driven by the idea to give the developer everything he needs to master typical tasks in a web application—in a convenient and technically sophisticated way alike.
From our point of view it’s the combination of features and quality of the product which accounts for ArangoDB: ArangoDB not only handles documents but also graphs.
ArangoDB is extensible via Javascript and Ruby. Enclosed with ArangoDB you get “Foxx”. Foxx is an integrated application framework ideal for lean back-ends and single page Javascript applications (SPA).
Multi-collection transactions are useful not only for online banking and e-commerce but they become crucial in any web app in a distributed architecture. Here again, we offer the developers many choices. If transactions are needed, developers can use them.
If, on the other hand, the problem requires a higher performance and less transaction-safety, developers are free to ignore multi-collections transactions and to use the standard single-document transactions implemented by most NoSQL databases.
Another unique feature is ArangoDB’s query language AQL—it makes querying powerful and convenient. For simple queries, we offer a simple query-by-example interface. Then again, AQL enables you to describe complex filter conditions and joins in a readable format.

Q10. Could you summarize the main results of your benchmarking tests?

Frank Celler: To quote Jan Lenhardt from CouchDB: “Nosql is not about performance, scaling, dropping ACID or hating SQL—it is about choice. As nosql databases are somewhat different it does not help very much to compare the databases by their throughput and chose the one which is fasted. Instead—the user should carefully think about his overall requirements and weight the different aspects. Massively scalable key/value stores or memory-only system[s] can archive much higher benchmarks. But your aim is [to] provide a much more convenient system for a broader range of use-cases—which is fast enough for almost all cases.”
Anyway, we have done a lot of performance tests and are more than happy with the results. ArangoDB 1.3 inserts up to 140,000 documents per second. We are going to publish the whole test suite including a test runner soon, so everybody can try it out on his own hardware.

We have also tested the space usage: Storing 3.5 millions AQL search queries takes about 200 MB in MongoDB with pre-allocation compared to 55 MB in ArangoDB. This is the benefit of implementing the concept of shapes.

Q11. ArangoDB is open source. How do you motivate and involve the open source development community to contribute to your projects rather than any other open source NoSQL?

Frank Celler: To be honest: The contributors come of their own volition and until now we didn’t have to “push” interested parties. Obviously, ArangoDB is fascinating enough, even though there are more than 150 NoSQL databases available to choose from.

It all started when Ruby inventor Yukihiro “Matz” Matsumoto tweeted on ArangoDB and recommended it to the community. Following this tweet, ArangoDB’s first local fan base was established in Japan—and we learned a lot about the limits of automatic translation from Japanese tweets to English and the other way around ;-).

In our daily “work” with our community, we try to be as open and supportive as possible. The core developers communicate directly and within short response times with people having ideas or needing help through Google Groups or GitHub. We take care of a community, especially for contributors, where we discuss future features and inform about upcoming changes early so that API contributors can keep their implementations up to date.

——————————
Martin Schönert
Martin is the origin of many fancy ideas in ArangoDB. As chief architect he is responsible for the overall architecture of the system, bringing in his experience from more than 20 years in IT as developer, architect, project manager and entrepreneur.
Martin started his career as scientist at the technical university of Aachen after earning his degree in Mathematics. Later he worked as head of product development (Team4 Systemhaus), Director IT (OnVista Technologies) and head of division at Deutsche Post.
Martin has been working with relational and non-relations databases (e.g. a torrid love-hate relationsship with the granddaddy of all non-relational databases: Lotus Notes) for the largest part of his professional life.
When no database did what he needed he also wrote his own, one for extremely high update rate and the other for distributed caching
.

Frank Celler
Frank is both entrepreneur and backend developer, developing mostly memory databases for two decades. He is the lead developer of ArangoDB and co-founder of triAGENS. Besides Frank organizes Cologne’s NoSQL user group, NoSQL conferences and is speaking at developer conferences.
Frank studied in Aachen and London and received a PHD in Mathematics. Prior to founding triAGENS, the company behind ArangoDB, he worked for several German tech companies as consultant, team lead and developer.
His technical focus is C and C++, recently he gained some experience with Ruby when integrating Mruby into ArangoDB.

Resources

- The stable version (1.3 branch) of ArangoDB can be downloaded here.
- ArangoDB on Twitter
- ArangoDB Google Group
- ArangoDB questions on StackOverflow
- Issue Tracker at Github

Related Posts

- On geo-distributed data management — Interview with Adam Abrevaya. October 19, 2013

- On Big Data and NoSQL. Interview with Renat Khasanshyn. October 7, 2013

- On NoSQL. Interview with Rick Cattell. August 19, 2013

- On Big Graph Data. August 6, 2012

Follow ODBMS.org on Twitter: @odbmsorg

##

Oct 19 13

On geo-distributed data management — Interview with Adam Abrevaya.

by Roberto V. Zicari

“Geo-distribution is the ability to distribute a single, logical SQL/ACID database that delivers transactional consistency across multiple datacenters, cloud provider regions, or a hybrid” — Adam Abrevaya.

I have interviewed Adam Abrevaya, Vice President of Engineering, NuoDB.

RVZ

Q1. You just launched NuoDB 2.0, what is special about it?

Adam Abrevaya: NuoDB Blackbirds Release 2.0 demonstrates a strong implementation of the NuoDB vision. It includes over 200 new features and improvements, making it even more stable and reliable than previous versions.
We have improved migration tools; included Java stored procedures; are introducing powerful automated administration; made enhancements to core geo-distribution functionality and more.

Q2. You offer a feature called geo-distribution. What is it and why is it useful?

Adam Abrevaya: Geo-distribution is the ability to distribute a single, logical SQL/ACID database that delivers transactional consistency across multiple datacenters, cloud provider regions, or a hybrid.

NuoDB’s geo-distributed data management lets customers build an active/active, highly-responsive database for high availability and low latency. By bringing the database closer to the end user, we can enable faster responses while simultaneously eliminating the time spent on complex tasks like replication, backup and recovery schemes.

One of the most exciting aspects of the Release 2.0 launch was the discussion about a major deployment of NuoDB Geo-Distribution by a customer. We were very excited to include Cameron Weeks, CEO and Co-Founder of Fathom Voice, talking about the challenges his company was facing—both managing his existing business and cost-effectively expanding globally. After a lengthy evaluation of alternative technologies, he found NuoDB’s distributed database is the only one that met his needs.

Q3. NuoDB falls broadly into the category of NewSQL databases, but you say that you are also a distributed database and that your architecture is fundamentally different than other databases out there. What’s different about it?

Adam Abrevaya: Yes, we are a NewSQL database and we offer the scale-out performance typically associated with NoSQL solutions, while still maintaining the safety and familiarity of SQL and ACID guarantees.

Our architecture, envisioned by renowned data scientist, Jim Starkey, is based on what we call “On-demand Replication”. We have an architecture whitepaper (registration required) which provides all the technical differentiators of our approach.

Q4. NuoDB is SQL compliant, and you claim that it scales elastically. But how do you handle complex join operations on data sets that are geographically distributed and at the same time scale (in) (out)?

Adam Abrevaya: NuoDB can have transactions that work against completely different on-demand caches.
For example, you can have OLTP transactions running in 9 Amazon AWS regions, each working on a subset of the overall database. Separately, there can be on-demand caches that can be dedicated to queries across the entire data set. NuoDB manages these on-demand ACID-compliant caches with very different use cases automatically without impact to the critical end user OLTP operations.

Q5. What is special about NuoDB with respect to availability? Several other NoSQL data stores are also resilient to infrastructure and partition failures.

Adam Abrevaya: First off, NuoDB offers a distributed SQL database system that provides all the ACID guarantees you expect from a relational database. We scale out like NoSQL databases, and offer support for handling independent failures at each level of our architecture. Redundant processes take over for failed processes (due to machine or other failures) and we make it easy for new machines and process to be brought online and added to the overall database dynamically. Applications that make use of the typical facilities when building an enterprise application will automatically reconnect to surviving processes in our system. We can detect network partition failures and allow the application to take appropriate measures.

Q6 How are some of your customers using NuoDB?

Adam Abrevaya: We are seeing a number of common uses of NuoDB among our customers. These range from startups building new web-facing solutions, to geo-distributed SaaS applications, to ISVs moving existing apps to the cloud, to all sorts of other apps that hit the performance wall with MySQL and other traditional DBMS. Ultimately, with lots of replication, sharding, new server hardware, etc., customers can use traditional databases to scale out or up but at a very high cost in terms of both time, money and usually by giving up transactional guarantees. One customer said he decided to look at alternatives to MySQL just because he was spending so much time in meetings talking about how to get it to do what they needed it to do. He added up the cost of the man-hours and he said “migrate.”

As I mentioned already, Fathom Voice, a SaaS provider offering VoIP, conference bridging, receptionist services and some innovative communications apps, had a global deployment challenge. How to get the database near their globe trotting customers; reduce latency and ensure redundancy. They are one of many customers and prospects tackling these issues.

———————-
Adam Abrevaya, Vice President of Engineering, NuoDB
Adam has been building and managing world-class engineering teams and products for almost two decades. His passion is around building and delivering high-performance core infrastructure products that companies depend on to build their businesses.

Adam started his career at MIT Lincoln Laboratory where he developed a distributed platform and image processing algorithms for detecting dangerous weather patterns in radar images. The system was deployed at several airports around the country.

From there, Adam joined Object Design and held various senior management positions where he was responsible for managing several major releases of ObjectStore (an Object database) along with spearheading the development team building XML products that included: Stylus Studio, an XML database, and a Business Process Manager.

Adam joined Pantero Corporation as VP of Development where he developed a revolutionary Semantic Data Integration product. Pantero was eventually sold to Progress Software.

From Pantero, Adam joined m-Qube to manage and build the team creating its Mobile Messaging Gateway platform. The m-Qube platform is a carrier grade product that has become the leading Mobile Messaging Gateway in North America and generated billions of dollars in revenue. Adam continued managing the mQube platform along with expanded roles after acquisitions of the technology from VeriSign and Mobile Messenger.

———

Related Posts

- On Big Data and NoSQL. Interview with Renat Khasanshyn. October 7, 2013

- On NoSQL. Interview with Rick Cattell. August 19, 2013

Resources

- Download NuoDB Pro Edition (Registration required) (NuoDB Blackbirds Release 2.0)

-ODBMS.org free resources on
Relational Databases, NewSQL, XML Databases, RDF Data Stores:
Blog Posts |Free Software | Articles and Presentations| Lecture Notes | Tutorials| Journals |

Follow ODBMS.org on Twitter: @odbmsorg

##

Oct 7 13

On Big Data and NoSQL. Interview with Renat Khasanshyn.

by Roberto V. Zicari

“The most important thing is to focus on a task you need to solve instead of a technology” –Renat Khasanshyn.

I have interviewed Renat Khasanshyn, Founder and Chairman of Altoros.
Renat is a NoSQL and Big Data specialist.

RVZ

Q1. In your opinion, what are the most popular players in the NoSQL market?

Khasanshyn: I think, MongoDB is definitely one of the key players of the NoSQL market. This database has a long history, I mean for this kind of products, and a good commercial support. For many people this database became the first mass market NoSQL store. I can assume that MongoDB is going to become something like MySQL for the field of relational databases. The second position I would give to Cassandra. It has a great architecture and enables building clusters with geographically dispersed nodes. For me it seems absolutely amazing. In addition, this database is often chosen by big companies that need a large highly available cluster.

Q2. How do you evaluate and compare different NoSQL Databases?

Khasanshyn: Thank you for an interesting question. How to choose a database? Which one is the best? These are the main questions for any company that wants to try a NoSQL solution. Definitely, for some cases it may be quite easy to select a suitable NoSQL store. However, very often it is not enough just to know customer’s business goals. When we suggest a particular database we take into consideration the following factors: the business issues a NoSQL store should solve, database reading/writing speed, availability, scalability, and many other important indicators. Sometimes we use a hybrid solution that may include several NoSQL databases.
Or we can see that a relational database will be a good match for a case. The most important thing is to focus on a task you need to solve instead of a technology.

We think that a good scalability, performance, and ease of administration are the most important criteria for choosing a NoSQL database. These are the key factors that we take into consideration. Of course, there are some additional criteria that sometimes may be even more important than those mentioned above. To somehow simplify a choice of a database for our engineers and for many other people, for two years, we carry out independent tests that evaluate performance of different NoSQL databases. Although aimed at comparing performance, these investigations also touch consistency, scalability, and configuration issues. You can take a look at our most recent article on this subject on Network World. Some new researches on this subject are to be published in a couple of months.

Q3. Which NoSQL databases did you evaluate so far, and what are the results did you obtain?

Khasanshyn: We used a great variety of NoSQL databases, for instance, Cassandra, HBase, MongoDB, Riak, Couchbase Server, Redis, etc. for our researches and real-life projects. From this experience, we learned that one should be very careful when choosing a database. It is better to spend more time on architecture design and make some changes to the project in the beginning rather than come across a serious issue in the future.

Q4. For which projects did use NoSQL databases and for what kind of problems?

Khasanshyn: It is hardly possible to name a project for which a NoSQL database would be useless, except for a blog or a home page. As the main use cases for NoSQL stores I would mention the following tasks:

● collecting and analyzing large volumes of data
● scaling large historical databases
● building interactive applications for which performance and fast response time to users’ actions are crucial

The major “drawback” of NoSQL architecture is the absence of ACID engine that provides a verification of transaction. It means that financial operations or user registration should be better performed by RDBMS like Oracle or MS SQL. However, absence of ACID allows for significant acceleration and decentralization of NoSQL databases which are their major advantages. The bottom line, non-relational databases are much faster in general, and they pay for it with a fraction of their reliability. Is it a good tradeoff? Depends on the task.

Q5. What do you see happening with NoSQL, going forward?

Khasanshyn: It’s quite difficult to make any predictions, but we guess that NoSQL and relational databases will become closer. For instance, NewSQL solutions took good scalability from NoSQL and a query language from the SQL products.
Probably, a kind of a standard query language based on SQL or SQL-like language will soon appear for NoSQL stores. We are also looking forward to improved consistency, or to be more precise, better predictability of NoSQL databases. In other words, NoSQL solutions should become more mature. We will also see some market consolidation. Smaller players will form alliances or quit the game. Leaders will take bigger part of the market. We will most likely see a couple of acquisitions. Overall, it will be easier to work with NoSQL and to choose a right solution out of the available options.

Q6. What what do you think is still needed for big data analytics to be really useful for the enterprise?

Khasanshyn: It is all about people and their skills. Storage is relatively cheap and available. Variety of databases is enormous and it helps solving virtually any task. Hadoop is stable. Majority of software is open source-based, or at least doesn’t cost a fortune. But all these components are useless without data scientists who can do modeling and simulations on the existing data. As well as without engineers who can efficiently employ the toolset. As well as without managers who understand the outcomes of data handling revolution that happened just recently. When we have these three types of people in place, then we will say that enterprises are fully equipped for making an edge in big data analytics.

Q7. Hadoop is still quite new for many enterprises, and different enterprises are at different stages in their Hadoop journey. When you speak with your customers what are the typical use cases and requirements they have?

Khasanshyn: I agree with you. Some customers just make their first steps with Hadoop while others need to know how to optimize their Hadoop-based systems. Unfortunately, the second group of customers is much smaller. I can name the following typical tasks our customers have:

● To process historical data that has been collected for a long period of time. Some time ago, users were unable to process large volumes of unstructured data due to some financial and technical limitations. Now Hadoop can do it at a moderate cost and for reasonable time.

● To develop a system for data analysis based on Hadoop. Once an hour, the system builds patterns on a typical user behavior on a Web site. These patterns help to react to users’ actions in the real-time mode, for instance, allow doing something or temporary block some actions because they are not typical of this user. The data is collected continuously and is analyzed at the same time. So, the system can rapidly respond to the changes in the user behavior.

● To optimize data storage. It is interesting that in some cases HDFS can replace a database, especially when the database was used for storing raw input data. Such projects do not need an additional database level.

I should say that our customers have similar requirements. Apart from solving a particular business task, they need a certain level of performance and data consistency.

Q8. In your opinion is Hadoop replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?

Khasanshyn: In a few words, my answer is yes. Some specific features of Hadoop enable to prepare data for future analysis fast and at a moderate cost. In addition, this technology can work with unstructured data. However, I do not think it will happen very soon. There are many OLAP systems and they solve their tasks, doing it better or worse. In general enterprises are very reluctant to change something. In addition, replacing the OLAP tools requires additional investments. The good news is that we don’t have to choose one or another. Hadoop can be used as a pre-processor of the data for OLAP analysis. And analysts can work with the familiar tools.

Q9. How do you categorize the various stages of the Hadoop usage in the enterprises?

Khasanshyn: I would name the following stages of Hadoop usage:

1. Development of prototypes to check out whether Hadoop is suitable for their tasks
2. Using Hadoop in combination with other tools for storing and processing data of some business units
3. Implementation of a centralized enterprise data storage system and gradual integration of all business units into it

Q10. Data Warehouse vs Big “Data Lake”. What are the similarities and what are the differences?

Khasanshyn: Even though Big “Data Lake” is a metaphor, I do not really like it.
This name highlights that it is something limited, isolated from other systems. I would better call this concept a “Big Data Ocean”, because the data added to the system can interact with the surrounding systems. In my opinion, data warehouses are a bit outdated. At the earlier stage, such systems enabled companies to aggregate a lot of data in one central storage and arrange this data. All this was done with the acceptable speed. Now there are a lot of cheap storage solutions, so we can keep enormous volumes of data and process it much faster than with data warehouses.

The possibility to store large volumes of data is a common feature of data warehouses and a Big “Data Lake”. With a flexible architectures and broad capabilities for data analysis and discovery, a Big “Data Lake” provides a wider range of business opportunities. A modern company should adjust to changes very fast. The structure that was good yesterday may become a limitation today.

Q11. In your opinion, is there a technology which is best suited to build a Big Data Analytics Data Platform? If yes, which one?

Khasanshyn: As I have already said, there is no “magic bullet” that can cure every disease. There is no “Universal Big Data Analytics Data Platform” fitting each size; everything depends on the requirements of a particular business. A system of this kind should have the following features:

● A Hadoop-based system for storing and processing data
● An operational database that contains the most recent data, it can be raw data, that should be analyzed.
A NoSQL solution can be used for this case.
● A database for storing predicted indicators. Most probably, it should be a relational data store.
● A system that allows for creating data analysis algorithms. The R language can be used to complete this task.
● A report building system that provides access to data. For instance, there such good options like Tableau or Pentaho.

Q12. What about elastic computing in the Cloud? How does it relate to Big Data Analytics?

Khasanshyn: In my opinion, cloud computing became the force that raised a Big Data wave. Elastic computing enabled us to use the amount of computation resources we need and also reduced the cost of maintaining a large infrastructure.

There is a connection between elastic computing and big data analytics. For us it is quite a typical case that we have to process data from time to time, for instance, once a day. To solve this task, we can deploy a new cluster or just scale up an existing Hadoop cluster in the cloud environment. We can temporary increase the speed of data processing by scaling a Hadoop cluster. The task will be executed faster and after that we can stop the Hadoop cluster or reduce its size. I can even say, that cloud technologies is a must have component for a Big Data analysis system.

——-

Renat Khasanshyn, Founder and Chairman, Altoros.
Renat is founder & CEO of Altoros, and Venture Partner at Runa Capital. Renat helps define Altoros’s strategic vision, and its role in Big Data, Cloud Computing, and PaaS ecosystem. Renat is a frequent conference and meetup speaker on this topic.
Under his supervision Altoros has been servicing such innovative companies as Couchbase, RightScale, Canonical, DataStax, Joyent, Nephoscale, and NuoDB.
In the past, Renat has been selected as finalist for the Emerging Executive of the Year award by the Massachusetts Technology Leadership Council and once won an IBM Business Mashup Challenge. Prior to founding Altoros, Renat was VP of Engineering for Tampa-based insurance company PriMed. Renat is also founder of Apatar, an open source data integration toolset, founder of Silicon Valley NewSQL User Group and co-founder of the Belarusian Java User Group.

——————–
Related Posts

- On NoSQL. Interview with Rick Cattell. August 19, 2013

- On Oracle NoSQL Database –Interview with Dave Segleau. July 2, 2013

- On Big Data and Hadoop. Interview with Paul C. Zikopoulos. June 10, 2013
————————-

Resources

- Evaluating NoSQL Performance, Sergey Bushik, Altoros (Slideshare)

- “A Vendor-independent Comparison of NoSQL Databases: Cassandra, HBase, MongoDB, Riak”. Altoros (registration required)

- ODBMS.org free resources on Big Data and Analytical Data Platforms:
Blog Posts | Free Software| Articles | Lecture Notes | PhD and Master Thesis|

- ODBMS.org free resources on NoSQL Data Stores:
Blog Posts | Free Software | Articles, Papers, Presentations| Documentations, Tutorials, Lecture Notes | PhD and Master Thesis

- ODBMS.org free resources on Cloud Data Stores:
Blog Posts | Lecture Notes| Articles and Presentations| PhD and Master Thesis|

———————–

Follow ODBMS.org on Twitter: @odbmsorg

##

Sep 23 13

Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett.

by Roberto V. Zicari

“The most valuable thing I’ve learned in this role is that judicious use of a little bit of knowledge can go a long way. I’ve seen colleagues and other companies get caught up in the “Big Data” craze by spend hundreds of thousands of pounds sterling on a Hadoop cluster that sees a few megabytes a month. But the most successful initiatives I’ve seen treat it as another tool and keep an eye out for valuable problems that they can solve.” –Matthew Eric Bassett.

I have interviewed Matthew Eric Bassett, Director of Data Science for NBCUniversal International.
NBCUniversal is one of the world’s leading media and entertainment companies in the development, production, and marketing of entertainment, news, and information to a global audience.
RVZ

Q1. What is your current activity at Universal?

Bassett: I’m the Director of Data Science for NBCUniversal International. I lead a small but highly effective predictive analytics team. I’m also a “data evangelist”; I spend quite a bit of my time helping other business units realize they can find business value from sharing and analyzing their data sources.

Q2. Do you use Data Analytics at Universal and for what?

Bassett: We predict key metrics for the different businesses – everything from television ratings, to how an audience will respond to marketing campaigns, to the value of a particular opening weekend for the box office. To do this, we use machine learning regression and classification algorithms, semantic analysis, monte-carlo methods, and simulations.

Q3. Do you have Big Data at Universal? Could you pls give us some examples of Big Data Use Cases at Universal?

Bassett: We’re not working with terabyte-scale data sources. “Big data” for us often means messy or incomplete data.
For instance, our cinema distribution company operates in dozens of countries. For each day in each one, we need to know how much money was spent and by whom -and feed this information into our machine-learning simulations for future predictions.
Each country might have dozens more cinema operators, all sending data in different formats and at different qualities. One territory may neglect demographics, another might mis-report gross revenue. In order for us to use it, we have to find missing or incorrect data and set the appropriate flags in our models and reports for later.

Automating this process is the bulk of our Big Data operation.

Q4. What “value” can be derived by analyzing Big Data at Universal?

Bassett: “Big data” helps everything from marketing, to distribution, to planning.
“In marketing, we know we’re wasting half our money. The problem is that we don’t know which half.” Big data is helping us solve that age-old marketing problem.
We’re able to track how the market is responding to our advertising campaigns over time, and compare it to past campaigns and products, and use that information to more precisely reach our audience (a bit how the Obama campaign was able to use big data to optimize its strategy).

In cinema alone, the opening weekend of a film can affect gross revenue by seven figures (or more), so any insight we can provide into the most optimal time can directly generate thousands or millions of dollars in revenue.

Being able to distill “big data” from historical information, audiences responses in social media, data from commercial operators, et cetera, into a useable and interactive simulation completely changes how we plan our strategy for the next 6-15 months.

Q5. What are the main challenges for big data analytics at Universal ?

Bassett: Internationalization, adoption, and speed.
We’re an international operation, so we need to extend our results from one country to another.
Some territories have a high correlation between our data mining operation and the metrics we want to predict. But when we extend to other territories we have several issues.
For instance, 1) it’s not as easy for us to do data mining on unstructured linguistic data (like audience’s comments on a youtube preview) and 2) User-generated and web analytics data is harder to find (and in some cases nonexistent!) in some of our markets, even if we did have a multi-language data mining capability. Less reliable regions, send us incoming data or historicals that are erroneous, incomplete, or simply not there – see my comment about “messy data”.

Reliability with internationalization feeds into another issue – we’re in an industry that historically uses qualitative and not quantitative processes. It takes quite a bit of “evangelicalism” to convince people what is possible with a bit of statistics and programming, and even after we’ve created a tool for a business, it takes some time for all the key players to trust and use it consistently.

A big part of accomplishing that is ensuring that our simulations and predictions happen fast.
Naturally, our systems need to be able to respond to market changes (a competing film studio changes a release date, an event in the news changes television ratings, et cetera) and inform people what happens.
But we need to give researchers and industry analysts feedback instantly – even while the underlying market is static – to keep them engaged. We’re often asking ourselves questions like “how can we make this report faster” or “how can we speed up this script that pulls audience info from a pdf”.

Q6. How do you handle the Big Data Analytics “process” challenges with deriving insight?
For example when:

  • -capturing data
  • -aligning data from different sources (e.g., resolving when two objects are the same)
  • -transforming the data into a form suitable for analysis
  • -modeling it, whether mathematically, or through some form of simulation
  • -understanding the output
  • -visualizing and sharing the results

Bassett: We start with the insight in mind: What blind-spots do our businesses have, what questions are they trying to answer and how should that answer be presented? Our process begins with the key business leaders and figuring out what problems they have – often when they don’t yet know there’s a problem.

Then we start our feature selection, and identify which sources of data will help achieve our end goal – sometimes a different business unit has it sitting in a silo and we need to convince them to share, sometimes we have to build a system to crawl the web to find and collect it.
Once we have some idea of what we want, we start brainstorming about the right methods and algorithms we should use to reveal useful information: Should we cluster across a multi-variate time series of market response per demographic and use that as an input for a regression model? Can we reliably get a quantitative measure of a demographics engagement from sentiment analysis on comments? This is an iterative process, and we spend quite a bit of time in the “capturing data/transforming the data” step.
But it’s where all the fun is, and it’s not as hard as it sounds: typically, the most basic scientific methods are sufficient to capture 90% of the business value, so long as you can figure out when and where to apply it and where the edge cases lie.

Finally, we have an another excited stage: find surprising insight from the results.
You might start by trying to get a metric for risk in cinema, and you might find a metric for how the risk changes for releases that target a specific audience in the process – and this new method might work for a different business.

Q7. What kind of data management technologies do you use? What is your experience in using them? Do you handle un-structured data? If yes, how?

Bassett: For our structured, relational data, we make heavy use of MySQL. Despite collecting and analyzing a great deal of un-structured data, we haven’t invested much in a NoSQL or related infrastructure. Rather, we store and organize such data as raw files on Amazon’s S3 – it might be dirty, but we can easily mount and inspect file systems, use our Bash kung-fu, and pass S3 buckets to Hadoop/Elastic MapReduce.

Q8. Do you use Hadoop? If yes, what is your experience with Hadoop so far?

Bassett: Yes, we sometimes use Hadoop for that “learning step” I described earlier, as well as batch jobs for data mining on collected information. However, our experience is limited to Amazon’s Elastic MapReduce, which makes the whole process quite simple – we literally write our map and reduce procedures (in whatever language we chose), tell Amazon where to find the code and the data, and grab some coffee while we wait for the results.

Q9. Hadoop is a batch processing system. How do you handle Big Data Analytics in real time (if any)?

Bassett: We don’t do any real-time analytics…yet. Thus far, we’ve created a lot of value from simulations that responds to changing marketing information.

Q10 Cloud computing and open source: Do you they play a role at Universal? If yes, how?

Bassett: Yes, cloud computing and open source play a major role in all our projects: our whole operation makes extensive use of Amazon’s EC2 and Elastic MapReduce for simulation and data mining, and S3 for data storage.

We’re big believers in functional programming – many projects start with “experimental programming” in Racket (a dialect of the Lisp programming
language) and often stay there into production.

Additionally, we take advantage of the thriving Python community for computational statistics: Ipython notebook, NumPy, SciPi, NLTK, et cetera.

Q11 What are the main research challenges ahead? And what are the main business challenges ahead?

Bassett: I alluded to some already previously: collecting and analyzing multi-lingual data, promoting the use of predictive analytics, and making things fast.

Recruiting top talent is frequently a discussion among my colleagues, but we’ve been quite fortunate in this regards. (And we devote a great deal of time in training for machine learning and big data methods.)

Qx Anything else you wish to add?

Bassett: The most valuable thing I’ve learned in this role is that judicious use of a little bit of knowledge can go a long way. I’ve seen colleagues and other companies get caught up in the “Big Data” craze by spend hundreds of thousands of pounds sterling on a Hadoop cluster that sees a few megabytes a month. But the most successful initiatives I’ve seen treat it as another tool and keep an eye out for valuable problems that they can solve.

Thanks!

—–

Matthew Eric Bassett -Director of Data Science, NBCUniversal International
Matthew Eric Bassett is a programmer and mathematician from Colorado and started his career there building web and database applications for public and non-profit clients. He moved to London in 2007 and worked as a consultant for startups and small businesses. In 2011, he joined Universal Pictures to work on a system to quantify risk in the international box office market, which led to his current position leading a predictive analytics “restructuring” of NBCUniversal International.
Matthew holds an MSci in Mathematics and Theoretical Physics from UCL and is currently pursuing a PhD in Noncommutative Geometry from Queen Mary, University of London, where he is discovering interesting, if useless, applications of his field to number theory and machine learning.

Resources

- How Did Big Data Help Obama Campaign? (Video Bloomberg TV)

- Google’s Eric Schmidt Invests in Obama’s Big Data Brains (Bloomberg Businessweek Technology)

- Cloud Data Stores – Lecture Notes: “Data Management in the Cloud”. Michael Grossniklaus, David Maier, Portland State University.
Lecture Notes | Intermediate/Advanced | English | DOWNLOAD ~280 slides (PDF)| 2011-12|

Related Posts

- Big Data from Space: the “Herschel” telescope. August 2, 2013

- Cloud based hotel management– Interview with Keith Gruen July 25, 2013

- On Big Data and Hadoop. Interview with Paul C. Zikopoulos. June 10, 2013

Follow ODBMS.org on Twitter: @odbmsorg

##

Sep 1 13

On Linked Data. Interview with John Goodwin.

by Roberto V. Zicari

“Semantic technologies may be unfamiliar, but when you have used them for a while you will realise they are no harder than many other technologies…in fact I would argue they are easier.”– John Goodwin.

On the topics of Semantic web technologies, ontology engineering, and linked data, I have interviewed John Goodwin. John is Principal Scientist in the Research Department of Ordnance Survey, which is Great Britain’s National Mapping Authority.

RVZ

Q1. You are a senior data scientist at the Ordnance Survey, Great Britain’s national mapping agency. What is your role there?

John Goodwin: I am a Principal Scientist in the Research Department of Ordnance Survey, which is Great Britain’s National Mapping Authority [note: we are authorities now...not agencies].

I have worked in research for Ordnance Survey for over 10 years now, and my research was mainly focused in semantic web technologies, ontology engineering and linked data. The Principal Scientist role is a fairly new one for me, and as part of this role I am now responsibly for a stream of research work around data management, data delivery and web services. This involves looking at new and novel technologies that ensure we have the correct infrastructure and data models to meet the challenges of the future. Furthermore, it is investigating new ways we can serve our our data to the end customer.

Q2. Do you have a Big Data problem at the at the Ordnance Survey? Could you please give us some examples of Big Data Use Cases?

John Goodwin:Hmmm, that is debatable. Ordnance Survey certainly has big ‘data problems’ but I don’t know if they qualify as ‘big data’ problems. I have heard Big Data defined as any data that won’t fit into Excel (which is a definition I personally hate), and if that is the case then we certainly have ‘Big Data’. Ordnance Survey currently stores information about half a billion topographic features, and 27.5 million geocoded address (with around 500,000 changes a year). So we may not have the sheer volume of data that some folk have, but I believe the combination of volume and complexity means that performing analysis over this data or running queries would certainly be a ‘Big Data’ problem.
For example, if you wanted to calculate the number of postboxes in Scotland, find the length of all roads in Great Britain you could be waiting some time using traditional database solutions.

Q3. The vision of the Semantic Web is the one where web pages contain self describing data that machines will be able to navigate them as easily as humans do now. What are the main benefits? Who could profit most from the Semantics Web?

John Goodwin: I think an immediate benefit is the ability to provide more structured data to search engines so that they can provide better search services. Structured web content means more meaningful search results and offers new ways to summarise and present summaries of pages in a search engine.

Q4. Who is currently using Semantic Web technologies and how? Could you please give us some examples of current commercial projects?

John Goodwin: One interesting example is a company called Garlik (now part of Experian). Garlik provides services to protect people from identify theft and financial fraud. They use semantic web technologies to integrate a number of different datasets, and provide a flexible way to integrate new datasets so they can perform queries across these datasets to find potential victims more easily. The BBC are a great users of linked data technology and used triplestore technologies as part of their content management systems for their World Cup and Olympics websites. Again the flexibility of the technology, and ability to link data across the whole of the BBC proved invaluable.

We are using linked data technologies at Ordnance Survey in research projects to look at way of integrating our data with third party data.

The major search engines are back an initiative called schema.org which will provide a unified schema for structure data in web content, and this has the potential (as mentioned above) to provide a richer search experience.

Q5. Do you use Linked Data? What are the main benefits of Linked Data in your opinion?

John Goodwin: I am a big supporter of linked data, and this has been the focus of my work for the last few years. I have used it in research projects and also produced the Ordnance Survey linked data.

Linked data is great for data integration – a common data language makes it easy (or rather easier) to bring a number of disparate datasets together. It is also more flexible than traditional relational database technologies. Like other NoSQL technologies linked data can be seen as ‘schemaless’ to some extent. This means if you want to change the datamode by, say, adding new attributes or properties it is very easy to do so. Furthermore, and this is a more personal thing, I find graphs to be the most natural way to think about data. It feel far more intuitive and I have to say I think querying graph data using SPARQL is far easier than querying relation data using SQL (especially if you have a lot of joins).

Q6. What data management technologies are best suited to model and query Linked Open Data?

John Goodwin: Linked Open Data is built around W3C standards such as RDF (resource description frame) – which is the data language of choice on the linked data web (although some people like to debate whether or not RDF is needed for linked data). RDF is to the web of data as HTML is to the web of documents…or at least that is how I see it. RDF has its own query language called SPARQL. A large number of programming libraries (e.g. Jena) are emerging to handle RDF. Furthermore, RDF can be stored in databases called triplestores and there are many triplestores to choose from. I am not in a position to advocate one triplestore over another but there are a large number of great technologies being developed by SMEs and more traditional database vendors alike. Furthermore, there are a number of open source options. We have experimented with a number of them at Ordnance Survey.

Q7. How do you integrate data from different sources that are not in Linked Open Data format (e.g. relational, raw data, etc.)?

John Goodwin: So far by converting the underlying data to RDF. Most relational data is a simple script away from being RDF. Tools do exist to help ‘triplify’ the data, but if I am honest I find that most of the time it is easier to write a quick Python script to do the job.
OpenRefine is a useful tool that lets you clean up csv data and has a plugin that allow export of data to RDF. OpenRefine additionally has the benefit of being able to work with reconciliation APIs. If a linked data site offers a reconciliation API you can use it with OpenRefine to, for example, convert a column of cities names or postcodes to URIs in the Ordnance Survey linked data. This is useful when you need to create explicit links to other datasets. For example, if you had a spreadsheet with place names like ‘City of Southampton’ you could use OpenRefine and the Ordnance Survey linked data reconciliation API to turn ‘City of Southampton’ into its URI.

Q8. What are the most promising application domains where you can apply RDF triple store technology such as AllegroGraph and Virtuoso?

John Goodwin: I think any domain where you either want to integrate lots of disparate datasets or you want a data model that is flexible, and where schema evolution might be a problem. I think geospatial is a promising domain as ‘everything happens somewhere’ and location provide a useful integration hub for many datasets. Semantic web technologies have also been used widely in the bioinformatics domain. The BBC are another great usecase – they use the technology to integrate data across their whole enterprise. This brings together data from news, radio, sport, television and music and allows new and exciting ways to explore the data.
I dare say it is also a technology that will prove useful/interesting to certain three letter American government agencies that have made the news recently :)

Q9. Do you use Data Analytics at the Ordnance Survey and for what?

John Goodwin: I would say currently we don’t really – thought it depends what you mean by analytics. We are largely concerned with collecting and maintaining data, and then shipping this out as products and services. We have experimented with an IBM® Netezza® appliance to perform queries over our data that would have taken too much time in traditional databases to answer questions such as ‘how many post boxes are there in Great Britain?’.

Q10. Can you do data analytics using Linked Open Data? If yes how?

John Goodwin: I think again it depends what is meant by analytics. Linked data offers a great way to bring lots of datasets together and then, maybe, materialise a view of those integrated datasets that could then be used to perform some analytics. Many people are doing ‘graph analytics’, and given that linked data is a graph I think there is some interesting work to be done in looking at the intersections of graph/network theory and linked data.

Q11. What are the main current obstacles for the adoption of Semantic Web technologies in the Enterprise?

John Goodwin: I think two main obstacles. The first one is a perception that RDF and linked data are hard, and somehow we need to overcome with perception. Lots of things in the ICT domain are hard…RDBMS is hard, C++ is hard etc. Semantic technologies may be unfamiliar, but when you have used them for a while you will realise they are no harder than many other technologies…in fact I would argue they are easier. I know a lot of developers who have moved onto using SPARQL and after a few months using it find it much easier to understand that SQL. Furthermore, I think it is harder to hire people with expertise in these technologies – there are still more people skilled up in traditional RDBMS and other newer NoSQL technologies like MongoDB.

I think the second obstacle is that semantic web technologies are, obviously, not going to be as mature as a good old relational database. There are some great triplestores out there, and there are enterprises who have successfully incorporated them (the BBC are a great example) but being a relatively new technology I suspect many enterprises are nervous to invest.

——————-
John Goodwin went to university at Royal Holloway and Bedford New College (University of London – based in Egham, Surrey) and graduated in 1992 with a 1st class honours degree in mathematics. Following that he moved to Cambridge and studied Part III of the Mathematics Tripos at the Department of Applied Maths and Theoretical Physics (University of Cambridge) where he obtained a Certificate of Advanced Study in Mathematics. John then moved to the University of Southampton to start his PhD.
He graduated in 1997 with a PhD in “The Cauchy Problem in Spacetimes with Closed Timelike Curves” (which can very roughly be paraphrased as ‘do timemachines blow up when you turn them on?’). In 1998 John left academia to start work at Ordnance Survey (located at postcode SO16 0AS) as a systems developer. He left Ordnance Survey in 2000 to start work at a small software company called Neusciences where he gained experience in various A.I. techniques. After just ten months at Neusciences John returned to Ordnance Survey to work in the research department where his research concentrated on the semantic web, ontologies and linked data. On the back of this research John produced the current Ordnance Survey linked data. He is currently still at Ordnance Survey and working as a Principal Scientist, where he leads research (at a technical and strategic level) into data managment, data delivery and services.
John currently chair the UK location council linked data working group and participate in the UK Government Linked Data working group.

Related Posts

- On Hybrid Relational Databases. Interview with Kingsley Uyi Idehen. May 13, 2013

- Graphs vs. SQL. Interview with Michael Blaha. April 11, 2013

- On Big Graph Data. August 6, 2012

Resources

ODBMS.org: free resources on Graphs and Data Stores
Blog Posts | Free Software | Articles, Papers, Presentations| Tutorials, Lecture Notes

——————————–
Follow ODBMS.org on Twitter: @odbmsorg