ODBMS Industry Watch » MySQL http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Sun, 02 Apr 2017 17:59:10 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.13 Database Challenges and Innovations. Interview with Jim Starkey http://www.odbms.org/blog/2016/08/database-challenges-and-innovations-interview-with-jim-starkey/ http://www.odbms.org/blog/2016/08/database-challenges-and-innovations-interview-with-jim-starkey/#comments Wed, 31 Aug 2016 03:33:42 +0000 http://www.odbms.org/blog/?p=4218

“Isn’t it ironic that in 2016 a non-skilled user can find a web page from Google’s untold petabytes of data in millisecond time, but a highly trained SQL expert can’t do the same thing in a relational database one billionth the size?.–Jim Starkey.

I have interviewed Jim Starkey. A database legendJim’s career as an entrepreneur, architect, and innovator spans more than three decades of database history.

RVZ

Q1. In your opinion, what are the most significant advances in databases in the last few years?

Jim Starkey: I’d have to say the “atom programming model” where a database is layered on a substrate of peer-to-peer replicating distributed objects rather than disk files. The atom programming model enables scalability, redundancy, high availability, and distribution not available in traditional, disk-based database architectures.

Q2. What was your original motivation to invent the NuoDB Emergent Architecture?

Jim Starkey: It all grew out of a long Sunday morning shower. I knew that the performance limits of single-computer database systems were in sight, so distributing the load was the only possible solution, but existing distributed systems required that a new node copy a complete database or partition before it could do useful work. I started thinking of ways to attack this problem and came up with the idea of peer to peer replicating distributed objects that could be serialized for network delivery and persisted to disk. It was a pretty neat idea. I came out much later with the core architecture nearly complete and very wrinkled (we have an awesome domestic hot water system).

Q3. In your career as an entrepreneur and architect what was the most significant innovation you did?

Jim Starkey: Oh, clearly multi-generational concurrency control (MVCC). The problem I was trying to solve was allowing ad hoc access to a production database for a 4GL product I was working on at the time, but the ramifications go far beyond that. MVCC is the core technology that makes true distributed database systems possible. Transaction serialization is like Newtonian physics – all observers share a single universal reference frame. MVCC is like special relativity, where each observer views the universe from his or her reference frame. The views appear different but are, in fact, consistent.

Q4. Proprietary vs. open source software: what are the pros and cons?

Jim Starkey: It’s complicated. I’ve had feet in both camps for 15 years. But let’s draw a distinction between open source and open development. Open development – where anyone can contribute – is pretty good at delivering implementations of established technologies, but it’s very difficult to push the state of the art in that environment. Innovation, in my experience, requires focus, vision, and consistency that are hard to maintain in open development. If you have a controlled development environment, the question of open source versus propriety is tactics, not philosophy. Yes, there’s an argument that having the source available gives users guarantees they don’t get from proprietary software, but with something as complicated as a database, most users aren’t going to try to master the sources. But having source available lowers the perceived risk of new technologies, which is a big plus.

Q5. You led the Falcon project – a transactional storage engine for the MySQL server- through the acquisition of MySQL by Sun Microsystems. What impact did it have this project in the database space?

Jim Starkey: In all honesty, I’d have to say that Falcon’s most important contribution was its competition with InnoDB. In the end, that competition made InnoDB three times faster. Falcon, multi-version in memory using the disk for backfill, was interesting, but no matter how we cut it, it was limited by the performance of the machine it ran on. It was fast, but no single node database can be fast enough.

Q6. What are the most challenging issues in databases right now?

Jim Starkey: I think it’s time to step back and reexamine the assumptions that have accreted around database technology – data model, API, access language, data semantics, and implementation architectures. The “relational model”, for example, is based on what Codd called relations and we call tables, but otherwise have nothing to do with his mathematic model. That model, based on set theory, requires automatic duplicate elimination. To the best of my knowledge, nobody ever implemented Codd’s model, but we still have tables which bear a scary resemblance to decks of punch cards. Are they necessary? Or do they just get in the way?
Isn’t it ironic that in 2016 a non-skilled user can find a web page from Google’s untold petabytes of data in millisecond time, but a highly trained SQL expert can’t do the same thing in a relational database one billionth the size?. SQL has no provision for flexible text search, no provision for multi-column, multi-table search, and no mechanics in the APIs to handle the results if it could do them. And this is just one a dozen problems that SQL databases can’t handle. It was a really good technical fit for computers, memory, and disks of the 1980’s, but is it right answer now?

Q7. How do you see the database market evolving?

Jim Starkey: I’m afraid my crystal ball isn’t that good. Blobs, another of my creations, spread throughout the industry in two years. MVCC took 25 years to become ubiquitous. I have a good idea of where I think it should go, but little expectation of how or when it will.

Qx. Anything else you wish to add?

Jim Starkey: Let me say a few things about my current project, AmorphousDB, an implementation of the Amorphous Data Model (meaning, no data model at all). AmorphousDB is my modest effort to question everything database.
The best way to think about Amorphous is to envision a relational database and mentally erase the boxes around the tables so all records free float in the same space – including data and metadata. Then, if you’re uncomfortable, add back a “record type” attribute and associated syntactic sugar, so table-type semantics are available, but optional. Then abandon punch card data semantics and view all data as abstract and subject to search. Eliminate the fourteen different types of numbers and strings, leaving simply numbers and strings, but add useful types like URL’s, email addresses, and money. Index everything unless told not to. Finally, imagine an API that fits on a single sheet of paper (OK, 9 point font, both sides) and an implementation that can span hundreds of nodes. That’s AmorphousDB.

————
Jim Starkey invented the NuoDB Emergent Architecture, and developed the initial implementation of the product. He founded NuoDB [formerly NimbusDB] in 2008, and retired at the end of 2012, shortly before the NuoDB product launch.

Jim’s career as an entrepreneur, architect, and innovator spans more than three decades of database history from the Datacomputer project on the fledgling ARPAnet to his most recent startup, NuoDB, Inc. Through the period, he has been
responsible for many database innovations from the date data type to the BLOB to multi-version concurrency control (MVCC). Starkey has extensive experience in proprietary and open source software.

Starkey joined Digital Equipment Corporation in 1975, where he created the Datatrieve family of products, the DEC Standard Relational Interface architecture, and the first of the Rdb products, Rdb/ELN. Starkey was also software architect for DEC’s database machine group.

Leaving DEC in 1984, Starkey founded Interbase Software to develop relational database software for the engineering workstation market. Interbase was a technical leader in the database industry producing the first commercial implementations of heterogeneous networking, blobs, triggers, two phase commit, database events, etc. Ashton-Tate acquired Interbase Software in 1991, and was, in turn, acquired by Borland International a few months later. The Interbase database engine was released open source by Borland in 2000 and became the basis for the Firebird open source database project.

In 2000, Starkey founded Netfrastructure, Inc., to build a unified platform for distributable, high quality Web applications. The Netfrastructure platform included a relational database engine, an integrated search engine, an integrated Java virtual machine, and a high performance page generator.

MySQL, AB, acquired Netfrastructure, Inc. in 2006 to be the kernel of a wholly owned transactional storage engine for the MySQL server, later known as Falcon. Starkey led the Falcon project through the acquisition of MySQL by Sun Microsystems.

Jim has a degree in Mathematics from the University of Wisconsin.
For amusement, Jim codes on weekends, while sailing, but not while flying his plane.

——————

Resources

NuoDB Emergent Architecture (.PDF)

On Database Resilience. Interview with Seth Proctor, ODBMs Industry Watch, March 17, 2015

Related Posts

– Challenges and Opportunities of The Internet of Things. Interview with Steve Cellini, ODBMS Industry Watch, October 7, 2015

– Hands-On with NuoDB and Docker, BY MJ Michaels, NuoDB. ODBMS.org– OCT 27 2015

– How leading Operational DBMSs rank popularity wise? By Michael Waclawiczek– ODBMS.org · JANUARY 27, 2016

– A Glimpse into U-SQL BY Stephen Dillon, Schneider Electric, ODBMS.org-DECEMBER 7, 2015

– Gartner Magic Quadrant for Operational DBMS 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/08/database-challenges-and-innovations-interview-with-jim-starkey/feed/ 0
LinkedIn China new Social Platform Chitu. Interview with Dong Bin. http://www.odbms.org/blog/2016/08/linkedin-china-new-social-platform-chitu-interview-with-dong-bin/ http://www.odbms.org/blog/2016/08/linkedin-china-new-social-platform-chitu-interview-with-dong-bin/#comments Thu, 04 Aug 2016 19:27:57 +0000 http://www.odbms.org/blog/?p=4181

“Complicated queries, like looking for second degree friends, is really hard to traditional databases.” –Dong Bin

I have interviewed Dong Bin, Engineer Manager at LinkedIn China. The LinkedIn China development team launched a new social platform — known as Chitu — to attract a meaningful segment of the Chinese professional networking market.

RVZ

Q1. What is your role at LinkedIn China?

Dong Bin: I am an Engineer Manager in charge of the backend services for Chitu. The backend includes all Chitu`s consumer based features, like feeds, chat, event, etc.

Q2. You recently launched a new social platform, called Chitu. Which segment of the Chinese professional networking market are you addressing with Chitu? How many users do you currently have?

Dong Bin: Unlike Linkedin.com, Chitu is targeting on young people without strong background, who mostly work at second-tier cities. They are eager to learn how to promote their career path. Due to business reasons, the members count can not be published yet. Sorry for that.

Q3. What are the main similarities and differences of Chitu with respect to LinkedIn?

Dong Bin: Besides the difference of user targeting, Chitu involves more popular features like Live Mode and knowledge monetization. And the Chitu team worked as a startup, which make the product run extremely fast. It is the key to beat the local competitors.

Q4. Who are your main competitors in China?

Dong Bin: The main competitors are: Maimai and Liepin.

Q5. What were the main challenges in developing Chitu?

Dong Bin: 1. At the beginning of the development, Chitu needed to be launched on an impossible deadline to catch up with competitors, by a team of engineers less than 20. 2. So many hot features are proposed which are so complicated from an implementation perspective, like friends with 1/2/3 degree, realtime chatting. They are tough problems for traditional infrastructure.

Q6. Why did you use a graph database for developing Chitu and not a conventional relational database?

Dong Bin: For development efficiency, I need a schemaless database which can handle relationships very easily. Schema will be a pain for fast iteration cause migration in many environment. And complicated queries, like looking for second degree friends, is really hard to traditional databases. Then I found graph database just fit my requirement.
Then I found graph database is good at performance of query connected data. With more than 10 years of experience of using relational database, I know that complicated joins are the performance killer. But graph databases kick ass of other databases.

Q7. What are the main advantages did you experience in using Neo4j?

Dong Bin: 1. I decide to use graph database and I found the No.1 graph database is Neo4j which make me no other choice; 2. Neo4J has a native graph storage; 3. The community is active and document is so rich, though it is comparable to MySQL or Oracle; 4. It is very fast.

Q8. Did you evaluate other graph databases in the market, other then Neo4j? If yes, which ones?

Dong Bin: Yes, I have evaluated OrientDB. I didn’t choose it because 1) it is not native graph storage, which make concern about performance;  2) the community and the documentation are weak.

Q9. Can you be a bit more specific, and explain what do you do with the Neo4j native graph storage, and why is it important for your application?

Dong Bin: Because native graph storage can handle query with joins very quickly. Chitu has so many queries depending on that. I have experience on that.

Q10. When you say, Neo4J is very fast, did you do any performance benchmarks? If yes, can you share the results? Did you do performance comparisons with other databases? 

Dong Bin: We did have some rough benchmarks, but now we focus on production performance metrics. In production log, I can see that 99% of the queries need no more than 10ms. This is the data I can provide with confidence.

Q11. What is the roadmap ahead for Chitu?

Dong Bin: The long-term goal is becoming the No.1 professional network platform in China. Also, Chitu will focus on knowledge sharing and monetization.

———–
Dong Bin is an Engineer Manager at Linkedin China. He has more than ten years experience of building web and database applications. His main interests are architecture for high performance and high stability. He has several years of database experience for MySQL, Redis and Mongodb, and fall in love with Graph DB after learning about Neo4j. Prior joining to Linkedin, he worked at Kabam as an Engineer Lead for developing mobile strategy game. He obtain a M.S in Harbin Institute of Technology in China. 

Resources

Chitu: Chitu is a social network app created by LinkedIn China.

– Neo4j Graph Database Helps LinkedIn China Launch Separate Professional Social Networking App

– Graph Databases for Beginners: Native vs. Non-Native Graph Technology

 Graph Databases. by Ian Robinson, Jim Webber, and Emil Eifrem. Published by O’Reilly Media, Inc. Second edition (224 pages).

Related Posts

– The Panama Papers: Why It Couldn’t Have Happened Ten Years Ago By Emil Eifrem, CEO, Neo Technology, ODBMS.org April 6, 2016

– Forrester Report: Graph Databases Market Overview, ODBMS.org,  AUGUST 31, 2015

– Embracing the evolution of Graphs. by Stephen Dillon, Data Architect, Schneider Electric. ODBMS.org, January 2015.

Graph Databases for Beginners: Why Data Relationships Matter. By Bryce Merkl Sasaki, ODBMS.org, July 31, 2015

– Graph Databases for Beginners: The Basics of Data Modeling. By BRYCE MERKL SASAKI, ODBMS.org, AUGUST 7, 2015

Graph Databases for Beginners: Why a Database Query Language Matters. BY BRYCE MERKL SASAKI, ODBMS.org, AUGUST 21, 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/08/linkedin-china-new-social-platform-chitu-interview-with-dong-bin/feed/ 2
On PostgreSQL. Interview with Bruce Momjian. http://www.odbms.org/blog/2014/06/bruce-momjian/ http://www.odbms.org/blog/2014/06/bruce-momjian/#comments Tue, 17 Jun 2014 15:02:01 +0000 http://www.odbms.org/blog/?p=3216

“There are four things that motivate open source development teams:
1. The challenge/puzzle of programming, 2. Need for the software, 3. Personal advancement, 4. Belief in open source”
— Bruce Momjian.

On PostgreSQL and the challenges of motivating and managing open source teams, I have interviewed Bruce Momjian, Senior Database Architect at EnterpriseDB, and Co-founder of the PostgreSQL Global Development Group and Core Contributor.

RVZ

Q1. How did you manage to transform PostgreSQL from an abandoned academic project into a commercially viable, now enterprise relational database?

Bruce Momjian: Ever since I was a developer of database applications, I have been interested in how SQL databases work internally. In 1996, Postgres was the only open source database available that was like the commercial ones I used at work. And I could look at the Postgres code and see how SQL was processed.

At the time, I also kind of had a boring job, or at least a non-challenging one, writing database reports and applications. So getting to know Postgres was exciting and helped me grow as a developer. I started getting more involved with Postgres. I took over the Postgres website from the university, reading bug reports and fixes, and I started interacting with other developers through the website. Fortunately, I got to know other developers who also found Postgres attractive and exciting, and together we assembled a group of people with similar interests.
We had a small user community at the time, but found enough developers to keep the database feature set moving forward.

As we got more users, we got more developers. Then commercial support opportunities began to grow, helping foster a rich ecosystem of users, developers and support companies that created a self-reinforcing structure that in turn continued to drive growth.

You look at Postgres now and it looks like there was some grand plan. But in fact, it was just a matter of setting up the some structure and continuing to make sure all the aspects continued to efficiently reinforce each other.

Q2. What are the current activities and projects of the PostgreSQL community Global Development Group?

Bruce Momjian: We are working on finalizing the version 9.4 beta, which will feature greatly improved JSON capabilities and lots of other stuff. We hope to release 9.4 in September/October of this year.

Q3. How do you manage motivating and managing open source teams?

Bruce Momjian: There are four things that motivate open source development teams:

* The challenge/puzzle of programming
* Need for the software
* Personal advancement
* Belief in open source

Our developers are motivated by a combination of these. Plus, our community members are very supportive of one another, meaning that working on Postgres is seen as something that gives contributors a sense of purpose and value.

I couldn’t tell you the mix of motivations that drive any one individual and I doubt many people you were to ask could answer the question simply. But it’s clear that a mixture of these motivations really drives everything we do, even if we can’t articulate exactly which are most important at any one time.

Q4. What is the technical roadmap for PostgreSQL?

Bruce Momjian: We are continuing to work on handling NoSQL-like workloads better. Our plans for expanded JSON support in 9.4 are part of that.

We need greater parallelism, particularly to allow a single query to make better user of all the server’s resources.
We have made some small steps in this area in v9.3 and plan to in v9.4, but there is still much work to be done.

We are also focused on data federation, allowing Postgres to access data from other data sources.
We already support many data interfaces, but they need improvement, and we need to add the ability to push joins and aggregates to foreign data sources, where applicable.

I should add that Evan Quinn, an analyst at Enterprise Management Associates, wrote a terrific research report, PostgreSQL: The Quite Giant of Enterprise Database, and included some information on plans for 9.4.

Q5. How do managers and executives view Postgres, and particularly how they make Postgres deployment decisions?

Bruce Momjian: In the early years, users that had little or no money, and organizations with heavy data needs and small profit margins drove the adoption of Postgres. These users were willing to overlook Postgres’ limitations.

Now that Postgres has filled out its feature set, almost every user segment using relational databases is considering Postgres. For management, cost savings are key. For engineers, it’s Postgres’ technology, ease of use and flexibility.

Q6. What is your role at EnterpriseDB?

Bruce Momjian: My primary responsibility at EnterpriseDB is to help the Postgres community.
EnterpriseDB supports my work as a core team member and I play an active role in the overall decision-making and the organization of community initiatives. I also travel frequently to conferences worldwide, delivering presentations on advances in Postgres and leading Postgres training sessions. At EnterpriseDB, I occasionally do trainings, help with tech support, attend conferences as a Postgres ambassador and visit customers. And of course, do some PR and interviews, like this one.

Q7. Has PostgreSQL still something in common with the original Ingres project at the University of California, Berkeley?

Bruce Momjian: Not really. There is no Ingres code in Postgres though I think the psql terminal SQL tool is similar to the one in Ingres.

Q8. If you had to compare PostgreSQL with MySQL and MariaDB, what would be the differentiators?

Bruce Momjian: The original focus of MySQL was simple read-only queries. While it has improved since then, it has struggled to go beyond that. Postgres has always targeted the middle-level SQL workload, and is now targeting high-end usage, and the simple-usage cases of NoSQL.

MySQL certainly has greater adoption and application support. But in almost every other measure, Postgres is a better choice for most users. The good news is that people are finally starting to realize that.

Q9. How do you see the database market evolving? And how do you position PostgreSQL in such database market?

Bruce Momjian: We are in close communication with our user community, with our developers reading and responding to email requests daily. That keeps our focus on users’ needs. Postgres, being an object-relational, extensible database, is well suited to being expanded to meet changing user workloads. I don’t think any other database has that level of flexibility and strong developer/user community interaction.

Qx. Would you like to add something?

Bruce Momjian: We had a strong PG NYC conference recently. I posted a summary about it here.

I think that conference highlights some significant trends for Postgres in the months ahead.

————-
Bruce Momjian co-founded in 1996 the PostgreSQL community Global Development Group, the organization of volunteers that steer the development and release of the PostgreSQL open source database. Bruce played a key role in organizing like-minded database professionals to shepherd PostgreSQL from an abandoned academic project into a commercially viable, now enterprise-class relational database. He dedicates the bulk of his time organizing, educating and evangelizing within the open source database community while acting as Senior Database Architect for EnterpriseDB. Bruce began his career as a high school math and computer science teacher and still serves as an adjunct profession at Drexel University. After leaving high school education, Bruce worked for more than a decade as a database consultant building specialized applications for law firms. He then went on to work for the PostgreSQL community with the support of several private companies before joining EnterpriseDB in 2006 to continue his work in the community. Bruce holds a master’s degree from Arcadia University and earned his bachelor’s degree at Columbia University.

Related Posts

On PostgreSQL. Interview with Tom Kincaid. ODBMS Industry Watch, May 30, 2013

MySQL-State of the Union. Interview with Tomas Ulin. ODBMS Industry Watch, February 11, 2013

Resources

The PostgreSQL Global Development Group

EnterpriseDB Postgres Plus Advanced Server

MariaDB vs MySQL, Daniel Bartholomew, Sr. Technical Writer, Monty Program

Evaluating the energy efficiency of OLTP operations:A case study on PostgreSQL.

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2014/06/bruce-momjian/feed/ 0
Big Data: Three questions to Aerospike. http://www.odbms.org/blog/2014/03/big-data-three-questions-to-aerospike/ http://www.odbms.org/blog/2014/03/big-data-three-questions-to-aerospike/#comments Sun, 02 Mar 2014 19:56:19 +0000 http://www.odbms.org/blog/?p=2861

“Many tools now exist to run database software without installing software. From vagrant boxes, to one click cloud install, to a cloud service that doesn’t require any installation, developer ease of use has always been a path to storage platform success.”–Brian Bulkowski.

The fifth interview in the “Big Data: three questions to “ series of interviews, is with Brian Bulkowski, Aerospike co-founder and CTO.

RVZ

Q1. What is your current product offering?

Brian Bulkowski: Aerospike is the first in-memory NoSQL database optimized for flash or solid state drives (SSDs).
In-memory for speed and NoSQL for scale. Our approach to memory is unique – we have built our own file system to access flash, we store indexes in DRAM and you can configure data sets to be in a combination of DRAM or flash. This gives you close to DRAM speeds, the persistence of rotational drives and the price performance of flash.
As next gen apps scale up beyond enterprise scale to “global scale”, managing billions of rows, terabytes of data and processing from 20k to 2 million read/write transactions per second, scaling costs are an important consideration. Servers, DRAM, power and operations – the costs add up, so even developers with small initial deployments must architect their systems with the bottom line in mind and take advantage of flash.
Aerospike is an operational database, a fast key-value store with ACID properties – immediate consistency for single row reads and writes, plus secondary indexes and user defined functions. Values can be simple strings, ints, blobs as well as lists and maps.
Queries are distributed and processed in parallel across the cluster and results on each node can be filtered, transformed, aggregated via user defined functions. This enables developers to enhance key value workloads with a few queries and some in-database processing.

Q2. Who are your current customers and how do they typically use your products?

Brian Bulkowski: We see two use cases – one as an edge database or real-time context store (user profile store, cookie store) and another as a very cost-effective and reliable cache in front of a relational database like mySQL or DB2.

Our customers are some of the biggest names in real-time bidding, cross channel (display, mobile, video, social, gaming) advertising and digital marketing, including AppNexus, BlueKai, TheTradeDesk and [X+1]. These companies use Aerospike to store real-time user profile information like cookies, device-ids, IP addresses, clickstreams, combined with behavioral segment data calculated using analytics platforms and models run in Hadoop or data warehouses. They choose Aerospike for predictable high performance, where reads and writes consistently, meaning 99% of the time, complete within 2-3 milliseconds.

The second set of customers use us in front of an existing database for more cost-effective and reliable caching. In addition to predictable high performance they don’t want to shard Redis, and they need persistence, high availability and reliability. Some need rack-awareness and cross data center support and they all want to take advantage of Aerospike deployments that are both simpler to manage and more cost-effective than alternative NoSQL databases, in-memory databases and caching technologies.

Q3. What are the main new technical features you are currently working on and why?

Brian Bulkowski: We are focused on ease of use, making development easier – quickly writing powerful, scalable applications – with developer tools and connectors. In our Aerospike 3 offering, we launched indexes and distributed queries, user defined functions for in-database processing, expressive API support, and aggregation queries. Performance continues to improve, with support for today’s highly parallel CPUs, higher density flash arrays, and improved allocators for RAM based in-memory use cases.

Developers love Aerospike because it’s easy to run a service operationally. That scale comes after the developer builds their original applications, so developers want samples and connectors that are tested and work easily. Whether that’s an ETL loader for CSV and JSON that’s parallel and scalable, a Hadoop connector to pour insights directly to Aerospike in order to drive hot interface changes, or improving our Mac OSX client that developers need, or HTTP/REST interfaces, developers need the ability to write their core application code to easily use Aerospike.

Many tools now exist to run database software without installing software. From vagrant boxes, to one click cloud install, to a cloud service that doesn’t require any installation, developer ease of use has always been a path to storage platform success.

Related Posts

Big Data: Three questions to McObject, ODBMS Industry Watch, February 14, 2014

Big Data: Three questions to VoltDB. ODBMS Industry Watch, February 6, 2014.

Big Data: Three questions to Pivotal. ODBMS Industry Watch, January 20, 2014.

Big Data: Three questions to InterSystems. ODBMS Industry Watch, January 13, 2014.

Operational Database Management Systems. Interview with Nick Heudecker, ODBMS Industry Watch, December 16, 2013.

Resources

Gartner – Magic Quadrant for Operational Database Management Systems (Access the report via registration). Authors: Donald Feinberg, Merv Adrian, Nick Heudecker, Date Published: 21 October 2013.

ODBMS.org free resources on NoSQL Data Stores
Blog Posts | Free Software | Articles, Papers, Presentations| Documentations, Tutorials, Lecture Notes | PhD and Master Thesis.

  • Follow ODBMS.org on Twitter: @odbmsorg

    ##

  • ]]>
    http://www.odbms.org/blog/2014/03/big-data-three-questions-to-aerospike/feed/ 1
    Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett. http://www.odbms.org/blog/2013/09/data-analytics-at-nbcuniversal-interview-with-matthew-eric-bassett/ http://www.odbms.org/blog/2013/09/data-analytics-at-nbcuniversal-interview-with-matthew-eric-bassett/#comments Mon, 23 Sep 2013 14:48:10 +0000 http://www.odbms.org/blog/?p=2639

    “The most valuable thing I’ve learned in this role is that judicious use of a little bit of knowledge can go a long way. I’ve seen colleagues and other companies get caught up in the “Big Data” craze by spend hundreds of thousands of pounds sterling on a Hadoop cluster that sees a few megabytes a month. But the most successful initiatives I’ve seen treat it as another tool and keep an eye out for valuable problems that they can solve.” –Matthew Eric Bassett.

    I have interviewed Matthew Eric Bassett, Director of Data Science for NBCUniversal International.
    NBCUniversal is one of the world’s leading media and entertainment companies in the development, production, and marketing of entertainment, news, and information to a global audience.
    RVZ

    Q1. What is your current activity at Universal?

    Bassett: I’m the Director of Data Science for NBCUniversal International. I lead a small but highly effective predictive analytics team. I’m also a “data evangelist”; I spend quite a bit of my time helping other business units realize they can find business value from sharing and analyzing their data sources.

    Q2. Do you use Data Analytics at Universal and for what?

    Bassett: We predict key metrics for the different businesses – everything from television ratings, to how an audience will respond to marketing campaigns, to the value of a particular opening weekend for the box office. To do this, we use machine learning regression and classification algorithms, semantic analysis, monte-carlo methods, and simulations.

    Q3. Do you have Big Data at Universal? Could you pls give us some examples of Big Data Use Cases at Universal?

    Bassett: We’re not working with terabyte-scale data sources. “Big data” for us often means messy or incomplete data.
    For instance, our cinema distribution company operates in dozens of countries. For each day in each one, we need to know how much money was spent and by whom -and feed this information into our machine-learning simulations for future predictions.
    Each country might have dozens more cinema operators, all sending data in different formats and at different qualities. One territory may neglect demographics, another might mis-report gross revenue. In order for us to use it, we have to find missing or incorrect data and set the appropriate flags in our models and reports for later.

    Automating this process is the bulk of our Big Data operation.

    Q4. What “value” can be derived by analyzing Big Data at Universal?

    Bassett: “Big data” helps everything from marketing, to distribution, to planning.
    “In marketing, we know we’re wasting half our money. The problem is that we don’t know which half.” Big data is helping us solve that age-old marketing problem.
    We’re able to track how the market is responding to our advertising campaigns over time, and compare it to past campaigns and products, and use that information to more precisely reach our audience (a bit how the Obama campaign was able to use big data to optimize its strategy).

    In cinema alone, the opening weekend of a film can affect gross revenue by seven figures (or more), so any insight we can provide into the most optimal time can directly generate thousands or millions of dollars in revenue.

    Being able to distill “big data” from historical information, audiences responses in social media, data from commercial operators, et cetera, into a useable and interactive simulation completely changes how we plan our strategy for the next 6-15 months.

    Q5. What are the main challenges for big data analytics at Universal ?

    Bassett: Internationalization, adoption, and speed.
    We’re an international operation, so we need to extend our results from one country to another.
    Some territories have a high correlation between our data mining operation and the metrics we want to predict. But when we extend to other territories we have several issues.
    For instance, 1) it’s not as easy for us to do data mining on unstructured linguistic data (like audience’s comments on a youtube preview) and 2) User-generated and web analytics data is harder to find (and in some cases nonexistent!) in some of our markets, even if we did have a multi-language data mining capability. Less reliable regions, send us incoming data or historicals that are erroneous, incomplete, or simply not there – see my comment about “messy data”.

    Reliability with internationalization feeds into another issue – we’re in an industry that historically uses qualitative and not quantitative processes. It takes quite a bit of “evangelicalism” to convince people what is possible with a bit of statistics and programming, and even after we’ve created a tool for a business, it takes some time for all the key players to trust and use it consistently.

    A big part of accomplishing that is ensuring that our simulations and predictions happen fast.
    Naturally, our systems need to be able to respond to market changes (a competing film studio changes a release date, an event in the news changes television ratings, et cetera) and inform people what happens.
    But we need to give researchers and industry analysts feedback instantly – even while the underlying market is static – to keep them engaged. We’re often asking ourselves questions like “how can we make this report faster” or “how can we speed up this script that pulls audience info from a pdf”.

    Q6. How do you handle the Big Data Analytics “process” challenges with deriving insight?
    For example when:

    • -capturing data
    • -aligning data from different sources (e.g., resolving when two objects are the same)
    • -transforming the data into a form suitable for analysis
    • -modeling it, whether mathematically, or through some form of simulation
    • -understanding the output
    • -visualizing and sharing the results

    Bassett: We start with the insight in mind: What blind-spots do our businesses have, what questions are they trying to answer and how should that answer be presented? Our process begins with the key business leaders and figuring out what problems they have – often when they don’t yet know there’s a problem.

    Then we start our feature selection, and identify which sources of data will help achieve our end goal – sometimes a different business unit has it sitting in a silo and we need to convince them to share, sometimes we have to build a system to crawl the web to find and collect it.
    Once we have some idea of what we want, we start brainstorming about the right methods and algorithms we should use to reveal useful information: Should we cluster across a multi-variate time series of market response per demographic and use that as an input for a regression model? Can we reliably get a quantitative measure of a demographics engagement from sentiment analysis on comments? This is an iterative process, and we spend quite a bit of time in the “capturing data/transforming the data” step.
    But it’s where all the fun is, and it’s not as hard as it sounds: typically, the most basic scientific methods are sufficient to capture 90% of the business value, so long as you can figure out when and where to apply it and where the edge cases lie.

    Finally, we have an another excited stage: find surprising insight from the results.
    You might start by trying to get a metric for risk in cinema, and you might find a metric for how the risk changes for releases that target a specific audience in the process – and this new method might work for a different business.

    Q7. What kind of data management technologies do you use? What is your experience in using them? Do you handle un-structured data? If yes, how?

    Bassett: For our structured, relational data, we make heavy use of MySQL. Despite collecting and analyzing a great deal of un-structured data, we haven’t invested much in a NoSQL or related infrastructure. Rather, we store and organize such data as raw files on Amazon’s S3 – it might be dirty, but we can easily mount and inspect file systems, use our Bash kung-fu, and pass S3 buckets to Hadoop/Elastic MapReduce.

    Q8. Do you use Hadoop? If yes, what is your experience with Hadoop so far?

    Bassett: Yes, we sometimes use Hadoop for that “learning step” I described earlier, as well as batch jobs for data mining on collected information. However, our experience is limited to Amazon’s Elastic MapReduce, which makes the whole process quite simple – we literally write our map and reduce procedures (in whatever language we chose), tell Amazon where to find the code and the data, and grab some coffee while we wait for the results.

    Q9. Hadoop is a batch processing system. How do you handle Big Data Analytics in real time (if any)?

    Bassett: We don’t do any real-time analytics…yet. Thus far, we’ve created a lot of value from simulations that responds to changing marketing information.

    Q10 Cloud computing and open source: Do you they play a role at Universal? If yes, how?

    Bassett: Yes, cloud computing and open source play a major role in all our projects: our whole operation makes extensive use of Amazon’s EC2 and Elastic MapReduce for simulation and data mining, and S3 for data storage.

    We’re big believers in functional programming – many projects start with “experimental programming” in Racket (a dialect of the Lisp programming
    language) and often stay there into production.

    Additionally, we take advantage of the thriving Python community for computational statistics: Ipython notebook, NumPy, SciPi, NLTK, et cetera.

    Q11 What are the main research challenges ahead? And what are the main business challenges ahead?

    Bassett: I alluded to some already previously: collecting and analyzing multi-lingual data, promoting the use of predictive analytics, and making things fast.

    Recruiting top talent is frequently a discussion among my colleagues, but we’ve been quite fortunate in this regards. (And we devote a great deal of time in training for machine learning and big data methods.)

    Qx Anything else you wish to add?

    Bassett: The most valuable thing I’ve learned in this role is that judicious use of a little bit of knowledge can go a long way. I’ve seen colleagues and other companies get caught up in the “Big Data” craze by spend hundreds of thousands of pounds sterling on a Hadoop cluster that sees a few megabytes a month. But the most successful initiatives I’ve seen treat it as another tool and keep an eye out for valuable problems that they can solve.

    Thanks!

    —–

    Matthew Eric Bassett -Director of Data Science, NBCUniversal International
    Matthew Eric Bassett is a programmer and mathematician from Colorado and started his career there building web and database applications for public and non-profit clients. He moved to London in 2007 and worked as a consultant for startups and small businesses. In 2011, he joined Universal Pictures to work on a system to quantify risk in the international box office market, which led to his current position leading a predictive analytics “restructuring” of NBCUniversal International.
    Matthew holds an MSci in Mathematics and Theoretical Physics from UCL and is currently pursuing a PhD in Noncommutative Geometry from Queen Mary, University of London, where he is discovering interesting, if useless, applications of his field to number theory and machine learning.

    Resources

    How Did Big Data Help Obama Campaign? (Video Bloomberg TV)

    Google’s Eric Schmidt Invests in Obama’s Big Data Brains (Bloomberg Businessweek Technology)

    Cloud Data Stores – Lecture Notes: “Data Management in the Cloud”. Michael Grossniklaus, David Maier, Portland State University.
    Lecture Notes | Intermediate/Advanced | English | DOWNLOAD ~280 slides (PDF)| 2011-12|

    Related Posts

    Big Data from Space: the “Herschel” telescope. August 2, 2013

    Cloud based hotel management– Interview with Keith Gruen July 25, 2013

    On Big Data and Hadoop. Interview with Paul C. Zikopoulos. June 10, 2013

    Follow ODBMS.org on Twitter: @odbmsorg

    ##

    ]]>
    http://www.odbms.org/blog/2013/09/data-analytics-at-nbcuniversal-interview-with-matthew-eric-bassett/feed/ 0
    Big Data from Space: the “Herschel” telescope. http://www.odbms.org/blog/2013/08/big-data-from-space-the-herschel-telescope/ http://www.odbms.org/blog/2013/08/big-data-from-space-the-herschel-telescope/#comments Fri, 02 Aug 2013 12:45:02 +0000 http://www.odbms.org/blog/?p=2169

    ” One of the biggest challenges with any project of such a long duration is coping with change. There are many aspects to coping with change, including changes in requirements, changes in technology, vendor stability, changes in staffing and so on”–Jon Brumfitt.

    On May 14, 2009, the European Space Agency launched an Arianne 5 rocket carrying the largest telescope ever flown: the “Herschel” telescope, 3.5 meters in diameter.

    I first did an interview with Dr. Jon Brumfitt, System Architect & System Engineer of Herschel Scientific Ground Segment, at the European Space Agency in March 2011. You can read that interview here.

    Two years later, I wanted to know the status of the project. This is a follow up interview.

    RVZ

    Q1. What is the status of the mission?

    Jon Brumfitt: The operational phase of the Herschel mission came to an end on 29th April 2013, when the super-fluid helium used to cool the instruments was finally exhausted. By operating in the far infra-red, Herschel has been able to see cold objects that are invisible to normal telescopes.
    However, this requires that the detectors are cooled to an even lower temperature. The helium cools the instruments down to 1.7K (about -271 Celsius). Individual detectors are then cooled down further to about 0.3K. This is very close to absolute zero, which is the coldest possible temperature. The exhaustion of the helium marks the end of new observations, but it is by no means the end of the mission.
    We still have a lot of work to do in getting the best results from the data processing to give astronomers a final legacy archive of high-quality data to work with for years to come.

    The spacecraft has been in orbit around a point known as the second Lagrangian point “L2″, which is about 1.5 million kilometres from Earth (around four times as far away as the Moon). This location provided a good thermal environment and a relatively unrestricted view of the sky. The spacecraft cannot be left in this orbit because regular correction manoeuvres would be needed. Consequently, it is being transferred into a “parking” orbit around the Sun.

    Q2. What are the main results obtained so far by using the “Herschel” telescope?

    Jon Brumfitt: That is a difficult one to answer in a few sentences. Just to take a few examples, Herschel has given us new insights into the way that stars form and the history of star formation and galaxy evolution since the big-bang.
    It has discovered large quantities of cold water vapour in the dusty disk surrounding a young star, which suggests the possibility of other water covered planets. It has also given us new evidence for the origins of water on Earth.
    The following are some links giving more detailed highlights from the mission:

    – Press
    – Results
    – Press Releases
    – Latest news

    With its 3.5 metre diameter mirror, Herschel is the largest space telescope ever launched. The large mirror not only gives it a high sensitivity but also allows us to observe the sky with a high spatial resolution. So in a sense every observation we make is showing us something we have never seen before. We have performed around 35,000 science observations, which have already resulted in over 600 papers being published in scientific journals. There are many years of work ahead for astronomers in interpreting the results, which will undoubtedly lead to many new discoveries.

    Q3. How much data did you receive and process so far? Could you give us some up to date information?

    Jon Brumfitt: We have about 3 TB of data in the Versant database, most of which is raw data from the spacecraft. The data received each day is processed by our data processing pipeline and the resulting data products, such as images and spectra, are placed in an archive for access by astronomers.
    Each time we make a major new release of the software (roughly every six months at this stage), with improvements to the data processing, we reprocess everything.
    The data processing runs on a grid with around 35 nodes, each with typically 8 cores and between 16 and 256 GB of memory. This is able to process around 40 days worth of data per day, so it is possible to reprocess everything in a few weeks. The data in the archive is stored as FITS files (a standard format for astronomical data).
    The archive uses a relational (PostgreSQL) database to catalogue the data and allow queries to find relevant data. This relational database is only about 60 GB, whereas the product files account for about 60 TB.
    This may reduce somewhat for the final archive, once we have cleaned it up by removing the results of earlier processing runs.

    Q4. What are the main technical challenges in the data management part of this mission and how did you solve them?

    Jon Brumfitt: One of the biggest challenges with any project of such a long duration is coping with change. There are many aspects to coping with change, including changes in requirements, changes in technology, vendor stability, changes in staffing and so on.

    The lifetime of Herschel will have been 18 years from the start of software development to the end of the post-operations phase.
    We designed a single system to meet the needs of all mission phases, from early instrument development, through routine in-flight operations to the end of the post-operations phase. Although the spacecraft was not launched until 2009, the database was in regular use from 2002 for developing and testing the instruments in the laboratory. By using the same software to control the instruments in the laboratory as we used to control them in flight, we ended up with a very robust and well-tested system. We call this approach “smooth transition”.

    The development approach we adopted is probably best classified as an Agile iterative and incremental one. Object orientation helps a lot because changes in the problem domain, resulting from changing requirements, tend to result in localised changes in the data model.
    Other important factors in managing change are separation of concerns and minimization of dependencies, for example using component-based architectures.

    When we decided to use an object database, it was a new technology and it would have been unwise to rely on any database vendor or product surviving for such a long time. Although work was under way on the ODMG and JDO standards, these were quite immature and the only suitable object databases used proprietary interfaces.
    We therefore chose to implement our own abstraction layer around the database. This was similar in concept to JDO, with a factory providing a pluggable implementation of a persistence manager. This abstraction provided a route to change to a different object database, or even a relational database with an object-relational mapping layer, should it have proved necessary.

    One aspect that is difficult to abstract is the use of queries, because query languages differ. In principle, an object database could be used without any queries, by navigating to everything from a global root object. However, in practice navigation and queries both have their role. For example, to find all the observation requests that have not yet been scheduled, it is much faster to perform a query than to iterate by navigation to find them. However, once an observation request is in memory it is much easier and faster to navigate to all the associated objects needed to process it. We have used a variety of techniques for encapsulating queries. One is to implement them as methods of an extent class that acts as a query factory.

    Another challenge was designing a robust data model that would serve all phases of the mission from instrument development in the laboratory, through pre-flight tests and routine operations to the end of post-operations. We approached this by starting with a model of the problem domain and then analysing use-cases to see what data needed to be persistent and where we needed associations. It was important to avoid the temptation to store too much just because transitive persistence made it so easy.

    One criticism that is sometimes raised against object databases is that the associations tend to encode business logic in the object schema, whereas relational databases just store data in a neutral form that can outlive the software that created it; if you subsequently decide that you need a new use-case, such as report generation, the associations may not be there to support it. This is true to some extent, but consideration of use cases for the entire project lifetime helped a lot. It is of course possible to use queries to work-around missing associations.

    Examples are sometimes given of how easy an object database is to use by directly persisting your business objects. This may be fine for a simple application with an embedded database, but for a complex system you still need to cleanly decouple your business logic from the data storage. This is true whether you are using a relational or an object database. With an object database, the persistent classes should only be responsible for persistence and referential integrity and so typically just have getter and setter methods.
    We have encapsulated our persistent classes in a package called the Core Class Model (CCM) that has a factory to create instances. This complements the pluggable persistence manager. Hence, the application sees the persistence manager and CCM factories and interfaces, but the implementations are hidden.
    Applications define their own business classes which can work like decorators for the persistent classes.

    Q5. What is your experience in having two separate database systems for Herschel? A relational database for storing and managing processed data products and an object database for storing and managing proposal data, mission planning data, telecommands and raw (unprocessed) telemetry?

    Jon Brumfitt: There are essentially two parts to the ground segment for a space observatory.
    One is the “uplink” which is used for controlling the spacecraft and instruments. This includes submission of observing proposals, observation planning, scheduling, flight dynamics and commanding.
    The other is the “downlink”, which involves ingesting and processing the data received from the spacecraft.

    On some missions the data processing is carried out by a data centre, which is separate from spacecraft operations. In that case there is a very clear separation.
    On Herschel, the original concept was to build a completely integrated system around an object database that would hold all uplink and downlink data, including processed data products. However, after further analysis it became clear that it was better to integrate our product archive with those from other missions. This also means that the Herschel data will remain available long after the project has finished. The role of the object database is essentially for operating the spacecraft and storing the raw data.

    The Herschel archive is part of a common infrastructure shared by many of our ESA science projects. This provides a uniform way of accessing data from multiple missions.
    The following is a nice example of how data from Herschel and our XMM-Newton X-ray telescope have been combined to make a multi-spectral image of the Andromeda Galaxy.

    Our archive, in turn, forms part of a larger international archive known as the “Virtual Observatory” (VO), which includes both space and ground-based observatories from all over the world.

    I think that using separate databases for operations and product archiving has worked well. In fact, it is more the norm rather than the exception. The two databases serve very different roles.
    The uplink database manages the day-to-day operations of the spacecraft and is constantly being updated. The uplink data forms a complex object graph which is accessed by navigation, so an object database is well suited.
    The product archive is essentially a write-once-read-many repository. The data is not modified, but new versions of products may be added as a result of reprocessing. There are a large number of clients accessing it via the Internet. The archive database is a catalogue containing the product meta-data, which can be queried to find the relevant product files. This is better suited to a relational database.

    The motivation for the original idea of using a single object database for everything was that it allowed direct association between uplink and downlink data. For example, processed products could be associated with their observation requests. However, using separate databases does not prevent one database being queried with an observation identifier obtained from the other.
    One complication is that processing an observation requires both downlink data and the associated uplink data.
    We solved this by creating “uplink products” from the relevant uplink data and placing them in the archive. This has the advantage that external users, who do not have access to the Versant database, have everything they need to process the data themselves.

    Q6. What are the main lessons learned so far in using Versant object database for managing telemetry data and information on steering and calibrating scientific on-board instruments?

    Jon Brumfitt: Object databases can be very effective for certain kinds of application, but may have less benefit for others. A complex system typically has a mixture of application types, so the advantages are not always clear cut. Object databases can give a high performance for applications that need to navigate through a complex object graph, particularly if used with fairly long transactions where a significant part of the object graph remains in memory. Web (JavaEE) applications lose some of the benefit because they typically perform many short transactions with each one performing a query. They also use additional access layers that result in a system which loses the simplicity of the transparent persistence of an object database.

    In our case, the object database was best suited for the uplink. It simplified the uplink development by avoiding object-relational mapping and the complexity of a design based on JDBC or EJB 2. Nowadays with JPA, relational databases are much easier to use for object persistence, so the rationale for using an object database is largely determined by whether the application can benefit from fast navigational access and how much effort is saved in mapping. There are now at least two object database vendors that support both JDO and JPA, so the distinction is becoming somewhat blurred.

    For telemetry access we query the database instead of using navigation, as the packets don’t fit neatly into a single containment hierarchy. Queries allows packets to be accessed by many different criteria, such as time, instrument, type, source and so on.
    Processing calibration observations does not introduce any special considerations as far as the database is concerned.

    Q7. Did you have any scalability and or availability issues during the project? If yes, how did you solve them?

    Jon Brumfitt: Scalability would have been an important issue if we had kept to the original concept of storing everything including products in a single database. However, using the object database for just uplink and telemetry meant that this was not a big issue.

    The data processing grid retrieves the raw telemetry data from the object database server, which is a 16-core Linux machine with 64 GB of memory. The average load on the server is quite low, but occasionally there have been high peak loads from the grid that have saturated the server disk I/O and slowed down other users of the database. Interactive applications such as mission planning need a rapid response, whereas batch data processing is less critical. We solved this by implementing a mechanism to spread out the grid load by treating the database as a resource.

    Once a year, we have made an “Announcement of Opportunity” for astronomers to propose observations that they would like to perform with Herschel. It is only human nature that many people leave it until the last minute and we get a very high peak load on the server in the last hour or two before the deadline! We have used a separate server for this purpose, rather than ingesting proposals directly into our operational database. This has avoided any risk of interfering with routine operations. After the deadline, we have copied the objects into the operational database.

    Q8. What about the overall performance of the two databases? What are the lessons learned?

    Jon Brumfitt: The databases are good at different things.
    As mentioned before, an object database can give a high performance for applications involving a complex object graph which you navigate around. An example is our mission planning system. Object persistence makes application design very simple, although in a real system you still need to introduce layers to decouple the business logic from the persistence.

    For the archive, on the other hand, a relational database is more appropriate. We are querying the archive to find data that matches a set of criteria. The data is stored in files rather than as objects in the database.

    Q9. What are the next steps planned for the project and the main technical challenges ahead?

    Jon Brumfitt: As I mentioned earlier, the coming post-operations phase will concentrate on further improving the data processing software to generate a top-quality legacy archive, and on provision of high-quality support documentation and continued interactive support for the community of astronomers that forms our “customer base”. The system was designed from the outset to support all phases of the mission, from early instrument development tests in the laboratory, though routine operations to the end of the post-operations phase of the mission. The main difference moving into post-operations is that we will stop uplink activities and ingesting new telemetry. We will continue to reprocess all the data regularly as improvements are made to the data processing software.

    We are currently in the process of upgrading from Versant 7 to Versant 8.
    We have been using Versant 7 since launch and the system has been running well, so there has been little urgency to upgrade.
    However, with routine operations coming to an end, we are doing some “technology refresh”, including upgrading to Java 7 and Versant 8.

    Q10. Anything else you wish to add?

    Jon Brumfitt: These are just some personal thoughts on the way the database market has evolved over the lifetime of Herschel. Thirteen years ago, when we started development of our system, there were expectations that object databases would really take off in line with the growing use of object orientation, but this did not happen. Object databases still represent rather a niche market. It is a pity there is no open-source object-database equivalent of MySQL. This would have encouraged more people to try object databases.

    JDO has developed into a mature standard over the years. One of its key features is that it is “architecture neutral”, but in fact there are very few implementations for relational databases. However, it seems to be finding a new role for some NoSQL databases, such as the Google AppEngine datastore.
    NoSQL appears to be taking off far quicker than object databases did, although it is an umbrella term that covers quite a few kinds of datastore. Horizontal scaling is likely to be an important feature for many systems in the future. The relational model is still dominant, but there is a growing appreciation of alternatives. There is even talk of “Polyglot Persistence” using different kinds of databases within a system; in a sense we are doing this with our object database and relational archive.

    More recently, JPA has created considerable interest in object persistence for relational databases and appears to be rapidly overtaking JDO.
    This is partly because it is being adopted by developers of enterprise applications who previously used EJB 2.
    If you look at the APIs of JDO and JPA they are actually quite similar apart from the locking modes. However, there is an enormous difference in the way they are typically used in practice. This is more to do with the fact that JPA is often used for enterprise applications. The distinction is getting blurred by some object database vendors who now support JPA with an object database. This could expand the market for object databases by attracting some traditional relational type applications.

    So, I wonder what the next 13 years will bring! I am certainly watching developments with interest.
    ——

    Dr Jon Brumfitt, System Architect & System Engineer of Herschel Scientific Ground Segment, European Space Agency.

    Jon Brumfitt has a background in Electronics with Physics and Mathematics and has worked on several of ESA’s astrophysics missions, including IUE, Hipparcos, ISO, XMM and currently Herschel. After completing his PhD and a post-doctoral fellowship in image processing, Jon worked on data reduction for the IUE satellite before joining Logica Space and Defence in 1980. In 1984 he moved to Logica’s research centre in Cambridge and then in 1993 to ESTEC in the Netherlands to work on the scientific ground segments for ISO and XMM. In January 2000, he joined the newly formed Herschel team as science ground segment System Architect. As Herschel approached launch, he moved down to the European Space Astronomy Centre in Madrid to become part of the Herschel Science Operations Team, where he is currently System Engineer and System Architect.

    Related Posts

    The Gaia mission, one year later. Interview with William O’Mullane. January 16, 2013

    Objects in Space: “Herschel” the largest telescope ever flown. March 18, 2011

    Resources

    Introduction to ODBMS By Rick Grehan

    ODBMS.org Resources on Object Database Vendors.

    —————————————
    You can follow ODBMS.org on Twitter : @odbmsorg

    ##

    ]]>
    http://www.odbms.org/blog/2013/08/big-data-from-space-the-herschel-telescope/feed/ 0
    On Oracle NoSQL Database –Interview with Dave Segleau. http://www.odbms.org/blog/2013/07/on-oracle-nosql-database-interview-with-dave-segleau/ http://www.odbms.org/blog/2013/07/on-oracle-nosql-database-interview-with-dave-segleau/#comments Tue, 02 Jul 2013 07:18:08 +0000 http://www.odbms.org/blog/?p=2454

    “We went down the path of building Oracle NoSQL database because of explicit request from some of our largest Oracle Berkeley DB installations that wanted to move away from maintaining home grown sharding implementations and very much wanted an out of box technology that can replicate the robustness of what they had built “out of box” ” –Dave Segleau.

    On October 3, 2011 Oracle announced the Oracle NoSQL Database, and on December 17, 2012, Oracle shipped Oracle NoSQL Database R2. I wanted to know more about the status of the Oracle NoSQL Database. I have interviewed Dave Segleau, Director of Product Management,Oracle NoSQL Database.

    RVZ

    Q1. Who is currently using Oracle NoSQL Database, and for what kind of domain specific applications? Please give us some examples.

    Dave Segleau: There are a range of users from segments such as Web-scale Transaction Processing, to Web-scale Personalization and Real-time Event Processing. To pick the area where I would say we see the largest adoption, it would be the Real-time Event Processing category. This is basically the use case that covers things like Fraud Detection, Telecom Services Billing, Online Gaming and Mobile Device Management.

    Q2. What is new in Oracle NoSQL Database R2?

    Dave Segleau: We added significant enhancements to NoSQL Database in the areas of Configuration Management/Monitoring (CM/M), APIs and Application Developer Usability, as well as Integration with the Oracle technology stack.
    In the area of CM/M, we added “Smart Topology” ( an automated capacity and reliability-aware data storage allocation with intelligent request routing), configuration elasticity and rebalancing, JMX/SNMP support. In the area of APIs and Application Developer Usability we added a C API, support for values as JSON objects (with AVRO serialization), JSON schema definitions, and a Large Object API (including a highly efficient streaming interface). In the area of Integration we added support for accessing NoSQL Database data via Oracle External Tables (using SQL in the Oracle Database), RDF Graph support in NoSQL Database, Oracle Coherence as well as integration with Oracle Event Processing.

    Q3. How would you compare Oracle NoSQL with respect to other NoSQL data stores, such as CouchDB, MongoDB, Cassandra and Riak?

    Dave Segleau: The Oracle NoSQL Database is a key-value store, although it also supports JSON as a value type similar to a document store. Architecturally it is closer to Riak, Cassandra and the Amazon Dynamo-based implementations, rather than the other technologies, at least at the highest level of abstraction. With regards to features, Oracle NoSQL Database shares a lot of commonality with Riak. Our performance and scalability characteristics are showing up with the best results in YCSB benchmarks.

    Q4. What is the implication of having Oracle Berkeley DB Java Edition as the core engine for the Oracle NoSQL database?

    Dave Segleau: It means that Oracle NoSQL Database provides a mission-critical proven database technology at the heart of the implementation. Many of the other NoSQL databases use relatively new implementations for data storage and replication. Databases in general, and especially distributed parallel databases, are hard to implement well and achieve high product quality and reliability. So we see the use of Oracle Berkeley DB, a pervasively deployed database engine for 1000’s of mission-critical applications, as a big differentiation. Plus, many of the early NoSQL technologies are based on Oracle Berkeley DB, for example LinkedIn’s Voldemort, Amazon’s Dynamo and other popular commercial and enterprise social media products like Yammer.
    The bottom line is that we went down the path of building Oracle NoSQL database because of explicit request from some of our largest Oracle Berkeley DB installations that wanted to move away from maintaining home grown sharding implementations and very much wanted an out of box technology that can replicate the robustness of what they had built “out of box”.

    Q5. What is the relationships between the underlying “cleaning” mechanism to free up unused space in Oracle Berkeley DB, and the predictability and throughput in Oracle NoSQL Database?

    Dave Segleau: As mentioned in the previous section, Oracle NoSQL Database uses Oracle Berkeley DB Java Edition as the key-value storage mechanism within the Storage Nodes. Oracle Berkeley DB Java Edition uses a no-overwrite log file system to store the data and a configurable multi-threaded background log cleaner task to compact and clean log files and free up unused disk space. The Oracle Berkeley DB log cleaner has underdone many years of in-house and real world high volume validation and tuning. Oracle NoSQL Database pre-defines the BDB cleaner parameters for optimal configuration for this particular use case. The cleaner enhances system throughput and predictability by a) running as a low level background task, b) being preconfigured to minimize impact on the running system. The combination of these two characteristics leads to more predictable system throughput.

    Several other NoSQL database products have implemented heavy weight tasks to compact, compress and free up disk space. Running them definitely impacts system throughput and predictability. From our point of view, not only do you want a NoSQL database that has excellent performance, but you also need predictable performance. Routine tasks like Java GCs and disk space management should not cause major impacts to operational throughput in a production system.

    Q7. Oracle NoSQL data model is using the concepts of “major” and “minor” key path. Why?

    Dave Segleau: We heard from customers that they wanted both even distribution of data as well as co-location of certain sets of records. The Major/Minor key paradigm provides the best of both worlds. The Major key is the basis for the hash function which causes Major key values to be evenly distributed throughput the key-value data store. The Minor key allows us to cluster multiple records for a given Major key together in the same storage location. In addition to being very flexible, it also provided additional benefits:
    a) A scalable two-tier indexing structure. A hash map of Major Keys to partitions that contain the data, and then a B-tree within each partition to quickly locate Minor key values.
    b) Minor keys allow us to perform efficient lookups and range scans within a Major key. For example, for userID 1234 (Major key), fetch all of the products that they browsed from January 1st to January 15th (Minor key).
    c) Because all of the Minor key records for a given Major key are co-located on the same storage location, this becomes our basic unit of ACID transactions, allowing applications to have a transaction that spans a single record, multiple records or even multiple write operations on multiple records for a given major key.

    This degree of flexibility is often lacking in other NoSQL database offerings.

    Q8. Oracle NoSQL database is a distributed, replicated key-value store using a shared-nothing master-slave architecture. Why did you choose to have a master node architecture? How does it compare with other systems which have no master?

    Dave Segleau: First of all, lets clarify that each shard has a master (multi master) and it is an elected master based system. The Oracle NoSQL Database topology is deployed with user-specified replication factor (how many copies of the data should the system maintain) and then using a PAXOS based mechanism, a master is elected. It is quite possible that a new master is elected under certain operating conditions. Plus, if you throw more hardware resources at the system, those “masters” will shift the data for which they are responsible, again to achieve the optimal latency profile. We are leveraging the enterprise grade replication technology that is widely deployed via the Oracle Berkeley DB Java Edition. Also, by using an elected master implementation, we can provide a fully ACID transaction on an operation by operation basis

    Q9. It is known that when the master node for a particular key-value fails (or because of a network failure), some writes may get lost. What is the implication from an application point of view?

    Dave Segleau: This is a complex question in that it depends largely on the type of durability requested for the operation and that is controlled by the developer. In general though, committed transactions acknowledged by a simple majority of nodes (our default durability) are not lost when a master fails. In the case of less aggressive durability policies, in-flight transactions that have been subject to network, disk, server failures, are handled similar to process failure in other database implementations, the transactions are rolled back. However, a new master will quickly be elected and future requests will go thru without a hitch. The applications can guard against such situations by handling exceptions and performing a retry.

    Q10. Justin Sheehy of Basho in an interview said (1): “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense” Would you recommend to your clients to use Oracle NoSQL database for banking applications?

    Dave Segleau: Absolutely. The Oracle NoSQL Database offers a range of transaction durability and consistency options on a per operation basis. The choice of eventual consistency is best made on a case by case basis, because while using it can open up new levels of scalability and performance, it does come with some risk and/or alternate processes which have a cost. Some NoSQL vendors don’t provide the options to leverage ACID transactions where they make sense, but the Oracle NoSQL Database does.

    Q11. Could you detail how Elasticity is provided in R2?

    Dave Segleau: The Oracle NoSQL database slices data up into partitions within highly available replication groups. Each replication group contains an elected master and a number of replicas based on user configuration. The exact configuration will vary depending on the read latency /write throughput requirements of the application. The processes associated with those replication groups run on hardware (Storage Nodes) declared to the Oracle NoSQL Database. For elasticity purposes, additional Storage Nodes can be declaratively added to a running system, in which case some of the data partitions will be re-allocated onto the new hardware, thereby increasing the number of shards and the write throughput. Additionally, the number of replicas can be increased to improved read latency and increase reliability. The process of rebalancing data partitions, spawning new replicas, and forming new Replication Groups will cause those internal data partitions to automatically move around the Storage Nodes to take advantage of the new storage capacity.

    Q12. What is the implication from a developer perspective of having a Avro Schema Support?

    Dave Segleau: For the developer, it means better support for seamless JSON storage. There are other downstream implications, like compatibility and integration with Hadoop processing where AVRO is quickly becoming a standard not only for efficient wireline serialization protocols, but for HDFS storage. Also, AVRO is a very efficient serialization format for JSON, unlike other serialization options like BSON which tend to be much less efficient. In the future, Oracle NoSQL Database will leverage this flexible AVRO schema definition in order to provide features like table abstractions, projections and secondary index support.

    Q13. How do you handle Large Object Support?

    Dave Segleau: Oracle NoSQL Database provides a streaming interface for Large Objects. Internally, we break a Large Object up into chunks and use parallel operations to read and write those chunks to/from the database. We do it in an ordered fashion so that you can begin consuming the data stream before all of the contents are returned to the application. This is useful when implementing functionality like scrolling partial results, streaming video, etc. Large Object operations are restartable and recoverable. Let’s say that you start to write a 1 GB Large Object and sometime during the write operation a failure occurs and the write is only partially completed. The application will get an exception. When the application re-issues the Large Object operation, NoSQL resumes where it left off, skipping chunks that were already successfully written.
    The Large Object chunking implementation also ensures that partially written Large Objects are not readable until they are completely written.

    Q14. A NoSQL Database can act as an Oracle Database External Table. What does it mean in practice?

    Dave Segleau: What we have achieved here is the ability to treat the Oracle NoSQL Database as a resource that can participate in SQL queries originating from an Oracle Database via standard SQL query facilities. Of course, the developer has to define a template that maps the “value” into a table representation. In release 2 we provide sample templates and configuration files that the application developer can use in order to define the required components. In the future, Oracle NoSQL Database will automate template definitions for JSON values. External Table capabilities give seamless access to both structured relational and unstructured NoSQL data using familiar SQL tools.

    Q15. Why Atomic Batching is important?

    Dave Segleau: If by Atomic Batching you mean the ability to perform more than one data manipulation in a single transaction, then atomic batching is the only real way to ensure logical consistency in multi-data update transactions. The Oracle NoSQL Database provides this capability for data beneath a given major key.

    Q16 What are the suggested criteria for users when they need to choose between durability for lower latency, higher throughput and write availability?

    Dave Segleau: That’s a tough one to answer, since it is so case by case dependent. As discussed above in the banking question, in general if you can achieve your application latency goals while specifying high durability, then that’s your best course of action. However, if you have more aggressive low-latency/high-throughput requirements, you may have to assess the impact of relaxing your durability constraints and of the rare case where a write operation may fail. It’s useful to keep in mind that a write failure is a rare event because of the inherent reliability built into the technology.

    Q17. Tomas Ulin mentioned in an interview (2) that “with MySQL 5.6, developers can now commingle the “best of both worlds” with fast key-value look up operations and complex SQL queries to meet user and application specific requirements”. Isn’t MySQL 5.6 in fact competing with Oracle NoSQL database?

    Dave Segleau: MySQL is an SQL database with a KV API on top. We are a KV database. If you have an SQL application with occasional need for fast KV access, MySQL is your best option. If you need pure KV access with unlimited scalability, then NoSQL DB is your best option.

    ———
    David Segleau is the Director Product Management for the Oracle NoSQL Database, Oracle Berkeley DB and Oracle Database Mobile Server. He joined Oracle as the VP of Engineering for Sleepycat Software (makers of Berkeley DB). He has more than 30 years of industry experience, leading and managing technical product teams and working extensively with database technology as both a customer and a vendor.

    Related Posts

    (1) On Eventual Consistency– An interview with Justin Sheehy. August 15, 2012.

    (2) MySQL-State of the Union. Interview with Tomas Ulin. February 11, 2013.

    Resources

    Charles Lamb’s Blog

    ODBMS.org: Resources on NoSQL Data Stores:
    Blog Posts | Free Software | Articles, Papers, Presentations| Documentations, Tutorials, Lecture Notes | PhD and Master Thesis.

    Follow ODBMS.org on Twitter: @odbmsorg
    ##

    ]]>
    http://www.odbms.org/blog/2013/07/on-oracle-nosql-database-interview-with-dave-segleau/feed/ 0
    On PostgreSQL. Interview with Tom Kincaid. http://www.odbms.org/blog/2013/05/on-postgresql-interview-with-tom-kincaid/ http://www.odbms.org/blog/2013/05/on-postgresql-interview-with-tom-kincaid/#comments Thu, 30 May 2013 10:05:20 +0000 http://www.odbms.org/blog/?p=2351

    “Application designers need to start by thinking about what level of data integrity they need, rather than what they want, and then design their technology stack around that reality. Everyone would like a database that guarantees perfect availability, perfect consistency, instantaneous response times, and infinite throughput, but it´s not possible to create a
    product with all of those properties”
    –Tom Kincaid.

    What is new with PostgreSQL? I have Interviewed Tom Kincaid, head of Products and Engineering at EnterpriseDB.

    RVZ

    (Tom prepared the following responses with contributions from the EnterpriseDB development team)

    Q1. EnterpriseDB products are based upon PostgreSQL. What is special about your product offering?

    Tom Kincaid: EnterpriseDB has integrated many enterprise features and performance enhancements into the core PostgreSQL code to create a database with the lowest possible TCO and provide the “last mile” of service needed by enterprise database users.

    EnterpriseDB´s Postgres Plus software provides the performance, security and Oracle compatibility needed to address a range of enterprise business applications. EnterpriseDB´s Oracle compatibility, also integrated into the PostgreSQL code base, allows many Oracle shops to realize a much lower database TCO while utilizing their Oracle skills and applications designed to work against Oracle databases.

    EnterpriseDB also creates enterprise-grade tools around PostgreSQL and Postgres Plus Advanced Server for use in large-scale deployments. They are Postgres Enterprise Manager, a powerful management console for managing, monitoring and tuning databases en masse whether they´re PostgreSQL community version or EnterpriseDB´s enhanced Postgres Plus Advanced Server; xDB Replication Server with multi-master replication and replication between Postgres, Oracle and SQL Server databases; and SQL/Protect for guarding against SQL Injection attacks.

    Q2. How does PostgreSQL compare with MariaDB and MySQL 5.6?

    Tom Kincaid: There are several areas of difference. PostgreSQL has traditionally had a stronger focus on data integrity and compliance with the SQL standard.
    MySQL has traditionally been focused on raw performance for simple queries, and a typical benchmark is the number of read queries per second that the database engine can carry out, while PostgreSQL tends to focus more on having a sophisticated query optimizer that can efficiently handle more complex queries, sometimes at the expense of speed on simpler queries. And, for a long time, MySQL had a big lead over PostgreSQL in the area of replication technologies, which discouraged many users from choosing PostgreSQL.

    Over time, these differences have diminished. PostgreSQL´s replication options have expanded dramatically in the last three releases, and its performance on simple queries has greatly improved in the most recent release (9.2). On the other hand, MySQL and MariaDB have both done significant recent work on their query optimizers. So each product is learning from the strengths of the other.

    Of course, there´s one other big difference, which is that PostgreSQL is an independent open source project that is not, and cannot be, controlled by any single company, while MySQL is now owned and controlled by Oracle.
    MariaDB is primarily developed by the Monty Program and shows signs of growing community support, but it does not yet have the kind of independent community that PostgreSQL has long enjoyed.

    Q3. Tomas Ulin mentioned in an interview that “with MySQL 5.6, developers can now commingle the “best of both worlds” with fast key-value look up operations and complex SQL queries to meet user and application specific requirements”. What is your take on this?

    Tom Kincaid: I think anyone who is developing an RDBMS today has to be aware that there are some users who are looking for the features of a key-value store or document database.
    On the other hand, many NoSQL vendors are looking to add the sorts of features that have traditionally been associated with an enterprise-grade RDBMS. So I think that theme of convergence is going to come up over and over again in different contexts.
    That´s why, for example, PostgreSQL added a native JSON datatype as part of the 9.2 release, which is being further enhanced for the forthcoming 9.3 release.
    Will we see a RESTful or memcached-like interface to PostgreSQL in the future? Perhaps.
    Right now our customers are much more focused on improving and expanding the traditional RDBMS functionality, so that´s where our focus is as well.

    Q4. How would you compare your product offering with respect to NoSQL data stores, such as CouchDB, MongoDB, Cassandra and Riak, and NewSQL such as NuoDB and VoltDB?

    Tom Kincaid: It is a matter of the right tools for the right problem. Many of our customers use our products together with the NoSQL solutions you mention. If you need ACID transaction properties for your data, with savepoints and rollback capabilities, along with the ability to access data in a standardized way and a large third party tool set for doing it, a time tested relational database is the answer.
    The SQL standard provides the benefit of always being able to switch products and having a host of tools for reporting and administration. PostgreSQL, like Linux, provides the benefit of being able to switch service partners.

    If your use case does not mandate the benefits mentioned above and you have data sets in the Petabyte range and require the ability to ingest Terabytes of data every 3-4 hours, a NoSQL solution is likely the right answer. As I said earlier many of our customers use our database products together with NoSQL solutions quite successfully. We expect to be working with many of the NoSQL vendors in the coming year to offer a more integrated solution to our joint customers.

    Since it is still pretty new, I haven´t had a chance to evaluate NuoDB so I can´t comment on how it compares with PostgreSQL or Postgres Plus Advanced Server.

    As far as VoltDB is concerned there is a blog by Dave Page, our Chief Architect for tools and installers, that describes the differences between PostgreSQL and VoltDB. It can be found here.

    There is also some terrific insight, on this topic, in an article by my colleague Bruce Momjian, who is one of the most active contributors to PostgreSQL, that can be found here.

    Q5. Justin Sheehy of Basho in an interview said “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense”. What is your opinion on this?

    Tom Kincaid: It´s overly simplistic. There is certainly room for asynchronous multi-master replication in applications such as banking, but it has to be done very, very carefully to avoid losing track of the money.
    It´s not clear that the NoSQL products which provide eventual consistency today make the right trade-offs or provide enough control for serious enterprise applications – or that the products overall are sufficiently stable. Relational databases remain the most mature, time-tested, and stable solution for storing enterprise data.
    NoSQL may be appealing for Internet-focused applications that must accommodate truly staggering volumes of requests, but we anticipate that the RDBMS will remain the technology of choice for most of the mission-critical applications it has served so well over the last 40 years.

    Q6. What are the suggested criteria for users when they need to choose between durability for lower latency, higher throughput and write availability?

    Tom Kincaid: Application designers need to start by thinking about what level of data integrity they need, rather than what they want, and then design their technology stack around that reality.
    Everyone would like a database that guarantees perfect availability, perfect consistency, instantaneous response times, and infinite throughput, but it´s not possible to create a product with all of those properties.

    If you have an application that has a large write throughput and you assume that you can store all of that data using a single database server, which has to scale vertically to meet the load, you´re going to be unhappy eventually. With a traditional RDBMS, you´re going to be unhappy when you can´t scale far enough vertically. With a distributed key-value store, you can avoid that problem, but then you have all the challenges of maintaining a distributed system, which can sometimes involve correlated failures, and it may also turn out that your application makes assumptions about data consistency that are difficult to guarantee in a distributed environment.

    By making your assumptions explicit at the beginning of the project, you can consider alternative designs that might meet your needs better, such as incorporating mechanisms for dealing with data consistency issues or even application-level shading into the application itself.

    Q7. How do you handle Large Objects Support?

    Tom Kincaid: PostgreSQL supports storing objects up to 1GB in size in an ordinary database column.
    For larger objects, there1s a separate large object API. In current releases, those objects are limited to just 2GB, but the next release of PostgreSQL (9.3) will increase that limit to 4TB. We don´t necessarily recommend storing objects that large in the database, though
    in many cases, it´s more efficient to store enormous objects on a file server rather than as database objects. But the capabilities are there for those who need them.

    Q8. Do you use Data Analytics at EnterpriseDB and for what?

    Tom Kincaid: Most companies today use some form of data analytics to understand their customers and their marketplace and we1re no exception. However, how we use data is rapidly changing given our rapid growth and deepening
    penetration into key markets.

    Q9. Do you have customers who have Big Data problem? Could you please give us some examples of Big Data Use Cases?

    Tom Kincaid: We have found that most customers with big data problems are using specialized appliances and in fact we partnered with Netezza to assist in creating such an appliance – The Netezza TwinFin Data Warehousing appliance.
    See here.

    Q10. How do you handle the Big Data Analytics “process” challenges with deriving insight?

    Tom Kincaid: EnterpriseDB does not specialize in solutions for the Big Data market and will refer prospects to specialists like Netezza.

    Q11. Do you handle un-structured data? If yes, how?

    Tom Kincaid: PostgreSQL has an integrated full-text search capability that can be used for document processing, and there are also XML and JSON data types that can be used for data of those types. We also have a PostgreSQL-specific data type called hstore that can be used to store groups of key-value pairs.

    Q12. Do you use Hadoop? If yes, what is your experience with Hadoop so far?

    Tom Kincaid: We developed, and released in late 2011, our Postgres Plus Connector for Hadoop, which allows massive amounts of data from a Postgres Plus Advanced Server (PPAS) or PostgreSQL database to be accessed, processed and analyzed in a Hadoop cluster. The Postgres Plus Connector for Hadoop allows programmers to process large amounts of SQL-based data using their familiar MapReduce constructs. Hadoop combined with PPAS or PostgreSQL enables users to perform real time queries with Postgres and non-real time CPU intensive analysis and with our connector, users can load SQL data to Hadoop, process it and even push the results back to Postgres.

    Q13 Cloud computing and open source: How does it relate to PostgreSQL?

    Tom Kincaid: In 2012, EnterpriseDB released its Postgres Plus Cloud Database. We´re seeing a wide-scale migration to cloud computing across the enterprise. With that growth has come greater clarity in what developers need in a cloudified database. The solutions are expected to deliver lower costs and management ease with even greater functionality because they are taking advantage of the cloud.

    ______________________
    Tom Kincaid.As head of Products and Engineering, Tom leads the company’s product development and directs the company’s world-class software engineers. Tom has nearly 25 years of experience in the Enterprise Software Industry.
    Prior to EnterpriseDB, he was VP of software development for Oracle’s GlassFish and Web Tier products.
    He integrated Sun’s Application Server Product line into Oracle’s Fusion middleware offerings. At Sun Microsystems, he was part of the original Java EE architecture and management teams and played a critical role in defining and delivering the Java Platform.
    Tom is a veteran of the Object Database industry and helped build Object Design’s customer service department holding management and senior technical contributor roles. Other positions in Tom’s past include Director of Quality Engineering at Red Hat and Director of Software Engineering at Unica.

    Related Posts

    MySQL-State of the Union. Interview with Tomas Ulin. February 11, 2013

    On Eventual Consistency– Interview with Monty Widenius. October 23, 2012

    Resources

    ODBMS.org: Relational Databases, NewSQL, XML Databases, RDF Data Stores
    Blog Posts |Free Software | Articles and Presentations| Lecture Notes | Tutorials| Journals |

    Follow ODBMS.org on Twitter: @odbmsorg

    ##

    ]]>
    http://www.odbms.org/blog/2013/05/on-postgresql-interview-with-tom-kincaid/feed/ 0
    Big Data for Genomic Sequencing. Interview with Thibault de Malliard. http://www.odbms.org/blog/2013/03/big-data-for-genomic-sequencing-interview-with-thibault-de-malliard/ http://www.odbms.org/blog/2013/03/big-data-for-genomic-sequencing-interview-with-thibault-de-malliard/#comments Mon, 25 Mar 2013 07:45:14 +0000 http://www.odbms.org/blog/?p=2144 “Working with empirical genomic data and modern computational models, the laboratory addresses questions relevant to how genetics and the environment influence the frequency and severity of diseases in human populations” –Thibault de Malliard.

    Big Data for Genomic Sequencing. On this subject, I have interviewed Thibault de Malliard, researcher at the University of Montreal’s Philip Awadalla Laboratory, who is working on bioinformatics solutions for next-generation genomic sequencing.

    RVZ

    Q1. What are the main research activities of the University of Montreal’s Philip Awadalla Laboratory?

    Thibault de Malliard: The Philip Awadalla Laboratory is the Medical and Population Genomics Laboratory at the University of Montreal. Working with empirical genomic data and modern computational models, the laboratory addresses questions relevant to how genetics and the environment influence the frequency and severity of diseases in human populations. Its research includes work relevant to all types of human diseases: genetic, immunological, infectious, chronic and cancer.
    Using genomic data from single-nucleotide polymorphisms (SNP), next-generation re-sequencing, and gene expression, along with modern statistical tools, the lab is able to locate genome regions that are associated with disease pathology and virulence as well as study the mechanisms that cause the mutations.

    Q2. What is the lab’s medical and population genomics research database?

    Thibault de Malliard: The lab’s database is regrouping all the mutations (SNPs) found by DNA genotyping, DNA sequencing and RNA sequencing for each samples. There is also annotation data from public databases.

    Q3. Why is data management important for the genomic research lab?

    Thibault de Malliard: All the data we have is in text csv files. This is what our software takes as input, and it will output other text csv files. So we use a lot of Bash and Perl to extract the information we need and to do some stats. As time goes, we multiply the number of files by sample, by experiment, and finally we get statistics based on the whole data that need recalculating each time we perform a new sequencing/genotyping (mutation frequency, mutations per gene, etc).

    With this database, we are also preparing for the lab’s future:
    • As the amount of data increases, one day the memory will not fit an associative array.
    • Looking to a 200 GB file to find one specific mutation will not be a good option.
    • Adding new data to the current files will take more and more time/space.
    • We need to be able to select the data according to every parameter we have, i.e., grouping by type of mutation and/or by chromosome, and/or by sample information by gender, ethnicity, age, or pathology.
    • We then need to export a file, or count / sum / average it.

    Q4. Could you give us a description of what kind of data is in the lab’s genomic research database storing and processing? And for what applications?

    Thibault de Malliard: We are storing single nucleotide polymorphisms (SNPs), which are the most common form of genetic mutations among people, from sequencing and genotyping. When an SNP is found for a sample, we also look at what we have at the same position for the other samples:
    • There is no SNP but data for the sample, so we know this sample does not have the SNP.
    OR
    • There is no data for the sample, so we cannot assess whether or not there is an SNP for this sample at this position.

    We gather between 1.8 and 2.5 million nucleotides (at least one sample has it) per sample, depending on the experiment technique. We store them in the database along with some information:
    • how damaged the SNP can be for the function of the gene
    • its frequency in different populations (African, European, French Canadian…).

    The database also contains information about each sample, such as gender, ethnicity, pathology. This will keep growing with our needs. So, basically, we have a sample table, a mutations table with their information, an experiment table and a big table linking the 3 previous tables with relations one to many.

    Here is a very slightly simplified example of a single record in our database:

    Single record in our database
    Type of data Data Table
    SNP T Begin Mutation information table
    Chromosome 1
    Position 100099771
    gene NZT
    Damaging for gene function? synonymous
    Present in known database? yes End Mutation information table
    Sequencing quality 26 Begin Table linking other tables together containing information about 1 mutation for 1 sample from 1 sequencing
    Sequencing coverage 15
    Validated by another experiment? no End Table linking other tables
    Sample 345 Begin Sample table
    Research project Project_1
    Gender Male
    Ethnicity French
    family 10 End Sample table
    Sequencing information Illumina Hiseq 2500 Begin Sequencing table
    Sequencing type (DNA RNA…) RNAseq
    Analysis pipeline info No PCR duplicates only Properly paired End Sequencing table

    The applications are multiple, but here are some which come to my mind:
    • extract subset of data to use with our tools
    • doing stats, counts
    • find specific data
    • annotate our data with public databases

    Q5. Why did you decide to deploy TokuDB database storage engine to optimize the lab’s medical and population genomics research database?

    Thibault de Malliard: We knew that the data could not be managed with MySQL and MyISAM. One big issue is the insert rate, and TokuDB offered a solution up to 50 times faster. Furthermore, TokuDB allows us to manipulate the structure of the database without blocking access to it. As a research team, we always have new information to add, which means column additions.

    Q6. Did you look/consider other vendor alternatives? If yes, which ones?

    Thibault de Malliard: None. This is much too time consuming.

    Q7. What are you specifically using TokuDB for?

    Thibault de Malliard: We only store genetic data with information related to this genetic.

    Q8. How many databases do you use? What are the data requirements?

    Thibault de Malliard: I had planned to use three databases:
    1. Database for RNA/DNA sequencing and from DNA genotyping (described before);
    2. Database for data from well-known reference databases (dbsnp, 1000genome);
    3. A last one to store analyzed data from database 1 and 2.

    The data stored is manly the nucleotide (a character: A, C, G, T) with integer information like quality, position, and Boolean flags. I avoid using any string to keep the table as small as possible.

    Q9. Especially, what are the requirements for data ingest of records and retrieve of data?

    Thibault de Malliard: As a research team, we do not have high requirements like real-time insertion from logs. But I would say, at most, the import should take less than a night. The update of the database 1 is critical with the addition of a new sequencing or genotyping experiment: a batch of 50M records (can be more than 3 times higher!) has to be inserted. This has been happening monthly, but it should increase this year.

    We have a huge amount of data, and we need to get query results as fast as possible, We have been used to one or two days (a weekend) of query time – having 10 seconds is much more preferable!

    Q10. Could you give some examples of what are the typical research requests that need data ingestion and retrieval

    Thibault de Malliard: We have a table with all the SNPs for 1000 samples. This is currently a 100GB table.
    A typical query could be to get the sample that got a mutation different from the 999 others. We also have some samples that are families: a child with its parents. We want to find the SNPs present in this child, but not present in the other family member.
    We may want to find mutations common to one group of sample given the gender, a disease state, ethnicity.

    Q11. What kind of scalability problems did you have?

    Thibault de Malliard: The problem is managing this huge amount of data. The number of connections should be very low. Most of the time, there is only one user. So I had to choose the data types carefully and the relationships between the tables. Lately, I ran into a very slow join with a range so I decided to split the position based tables by chromosome. Now there are 26 tables and some procedures to launch queries through the chromosomes. The gain of time is not quantifiable.

    Q12. Do you have any benchmarking measures to sustain the claim that Tokutek’s TokuDB has improved scalability of your system?

    Thibault de Malliard: I populated the database with two billion records in the main table and then did queries. While I did not see improvements with my particular workload for queries, I did see significant insertion performance gains. When I tried to add an extra 1M record (Load data infile), it took 51 minutes for MyISAM to load the data, but less than one minute with TokuDB. I extend this amount of data to an RNA sequencing experiment: it should take 2.5 days for MyISAM but one hour for TokuDB.

    Q13. What are the lessons learned so far in using TokuDB database storage engine in your application domain?

    Thibault de Malliard: We are still developing it and adding data. But inserting data into the two main tables (0.9G records, 2.3G records) was done in a fairly good time, less than one day. Adding columns to fulfill the needs of the team is also a very easy feature: it takes one second to create the column. Updating it is another story, but the table is still accessible during this process.
    Another great feature, like the one I use with each query, is to be able to follow the state of the query.
    You can follow in the process list the number of rows which were queried. So if you have a good estimation of the number of records expected, you know exactly the time of the query. I cannot count the number of process I killed because the query time expected was not acceptable.

    Qx. Anything you wish to add?

    Thibault de Malliard: The sequencing/genotyping technologies evolve very fast. Evolving means more data from the machines. I expect our data to grow at least three times each year. We are glad to have TokuDB in place to handle the challenge.

    ————-
    Since 2010, Thibault de Malliard has worked in the University of Montreal’s Philip Awadalla Laboratory where he provides bioinformatics support to the lab crew and develops bioinformatics solutions for next-generation genomic sequencing. Previously, he worked for the French National Institute for Agricultural Research (INRA) with the MIG laboratory (Mathematics, Informatics and Genomics) where, as part of the European Nanomubiop project, he was tasked with developing software to produce probes for a HPV chip. He holds a masters degree in bioinformatics (France).

    Related Posts

    Big Data: Improving Hadoop for Petascale Processing at Quantcast. March 13, 2013

    On Big Data Analytics –Interview with David Smith. February 27, 2013

    Big Data Analytics at Netflix. Interview with Christos Kalantzis and Jason Brown. February 18, 2013

    MySQL-State of the Union. Interview with Tomas Ulin. February 11, 2013

    Scaling MySQL and MariaDB to TBs: Interview with Martín Farach-Colton. October 8, 2012

    Related Resources

    Big Data for Good. by Roger Barca, Laura Haas, Alon Halevy, Paul Miller, Roberto V. Zicari. June 5, 2012:
    A distinguished panel of experts discuss how Big Data can be used to create Social Capital.
    Blog Panel | Intermediate | English | DOWNLOAD (PDF)| June 2012|

    ODBMS.org resources on Relational Databases, NewSQL, XML Databases, RDF Data Stores.

    Follow ODBMS.org on Twitter: @odbmsorg
    ##

    ]]>
    http://www.odbms.org/blog/2013/03/big-data-for-genomic-sequencing-interview-with-thibault-de-malliard/feed/ 2
    MySQL-State of the Union. Interview with Tomas Ulin. http://www.odbms.org/blog/2013/02/mysql-state-of-the-union-interview-with-tomas-ulin/ http://www.odbms.org/blog/2013/02/mysql-state-of-the-union-interview-with-tomas-ulin/#comments Mon, 11 Feb 2013 07:58:15 +0000 http://www.odbms.org/blog/?p=2007 “With MySQL 5.6, developers can now commingle the “best of both worlds” with fast key-value look up operations and complex SQL queries to meet user and application specific requirements” –Tomas Ulin.

    On February 5, 2013, Oracle announced the general availability of MySQL 5.6.
    I have interviewed Tomas Ulin, Vice President for the MySQL Engineering team at Oracle. I asked him several questions on the state of the union for MySQL.

    RVZ

    Q1. You support several different versions of the MySQL database. Why? How do they differ with each other?

    Tomas Ulin: Oracle provides technical support for several versions of the MySQL database to allow our users to maximize their investments in MySQL. Additional details about Oracle’s Lifetime Support policy can be found here.

    Each new version of MySQL has added new functionality and improved the user experience. Oracle just made available MySQL 5.6, delivering enhanced linear scalability, simplified query development, better transactional throughput and application availability, flexible NoSQL access, improved replication and enhanced instrumentation.

    Q2. Could you please explain in some more details how MySQL can offer a NoSQL access?

    Tomas Ulin: MySQL 5.6 provides simple, key-value interaction with InnoDB data via the familiar Memcached API. Implemented via a new Memcached daemon plug-in to mysqld, the new Memcached protocol is mapped directly to the native InnoDB API and enables developers to use existing Memcached clients to bypass the expense of query parsing and go directly to InnoDB data for lookups and transactional compliant updates. With MySQL 5.6, developers can now commingle the “best of both worlds” with fast key-value look up operations and complex SQL queries to meet user and application specific requirements. More information is available here.

    MySQL Cluster presents multiple interfaces to the database, also providing the option to bypass the SQL layer entirely for native, blazing fast access to the tables. Each of the SQL and NoSQL APIs can be used simultaneously, across the same data set. NoSQL APIs for MySQL Cluster include memcached as well as the native C++ NDB API, Java (ClusterJ and ClusterJPA) and HTTP/REST. Additionally, during our MySQL Connect Conference last fall, we announced a new Node.js NoSQL API to MySQL Cluster as an early access feature. More information is available here.

    Q3. No single data store is best for all uses. What are the applications that are best suited for MySQL, and which ones are not?

    Tomas Ulin: MySQL is the leading open source database for web, mobile and social applications, delivered either on-premise or in the cloud. MySQL is increasingly used for Software-as-a-Service applications as well.

    It is also a popular choice as an embedded database with over 3,000 ISVs and OEMs using it.

    MySQL is also widely deployed for custom IT and departmental enterprise applications, where it is often complementary to Oracle Database deployments. It also represents a compelling alternative to Microsoft SQL Server with the ability to reduce database TCO by up to 90 percent.

    From a developer’s perspective, there is a need to address growing data volumes, and very high data ingestion and query speeds, while also allowing for flexibility in what data is captured. For this reason, the MySQL team works to deliver the best of the SQL and Non-SQL worlds to our users, including native, NoSQL access to MySQL storage engines, with benchmarks showing 9x higher INSERT rates than using SQL, while also supporting online DDL.

    At the same time, we do not sacrifice data integrity by trading away ACID compliance, and we do not trade away the ability to run complex SQL-based queries across the same data sets. This approach enables developers of new services to get the best out of MySQL database technologies.

    Q4. Playful Play, a Mexico-based company, is using MySQL Cluster Carrier Grade Edition (CGE) to support three million subscribers on Facebook in Latin America. What are the technical challenges they are facing for such project? How do they solve them?

    Tomas Ulin: As a start-up business, fast time to market at the lowest possible cost was their leading priority. As a result, they developed the first release of the game on the LAMP stack.

    To meet both the scalability and availability requirements of the game, Playful Play initially deployed MySQL in a replicated, multi-master configuration.

    As Playful Play’s game, La Vecidad de El Chavo, spread virally across Facebook, subscriptions rapidly exceeded one million users, leading Playful Play to consider how to best architect their gaming platforms for long-term growth.

    The database is core to the game, responsible for managing:
    • User profiles and avatars
    • Gaming session data;
    • In-app (application) purchases;
    • Advertising and digital marketing event data.

    In addition to growing user volumes, the El Chavo game also added new features that changed the profile of the database. Operations became more write-intensive, with INSERTs and UPDATEs accounting for up to 70 percent of the database load.

    The game’s popularity also attracted advertisers, who demanded strict SLAs for both performance (predictable throughput with low latency) as well as uptime.

    After their evaluation, PlayFul Play decided MySQL Cluster was best suited to meet their needs for scale and HA.

    After their initial deployment, they engaged MySQL consulting services from Oracle to help optimize query performance for their environment, and started to use MySQL Cluster Manager to manage their installation, including automating the scaling of their infrastructure to support the growth from 30,000 new users every day. With subscriptions to MySQL Cluster CGE, which includes Oracle Premier Support and MySQL Cluster Manager in an integrated offering, Playful Play has access to qualified technical and consultative support which is also very important to them.

    Playful Play currently supports more than four million subscribers with MySQL.

    More details on their use of MySQL Cluster are here.

    Q5. What are the current new commercial extensions for MySQL Enterprise Edition? How do commercial extensions differ from standard open source features of MySQL?

    Tomas Ulin: MySQL Community Edition is available to all at no cost under the GPL. MySQL Enterprise Edition includes advanced features, management tools and technical support to help customers improve productivity and reduce the cost, risk and time to develop, deploy and manage MySQL applications. It also helps customers improve the performance, security and uptime of their MySQL-based applications.

    This link is to a short demo that illustrates the added value MySQL Enterprise Edition offers.

    Further details about all the commercial extensions can be found in this white paper. (Edit: You must be logged in to access this content.)

    The newest additions to MySQL Enterprise Edition, released last fall during MySQL Connect, include:

    * MySQL Enterprise Audit, to quickly and seamlessly add policy-based auditing compliance to new and existing applications.

    * Additional MySQL Enterprise High Availability options, including Distributed Replicated Block Device (DRBD) and Oracle Solaris Clustering, increasing the range of certified and supported HA options for MySQL.

    Q6. In September 2012, Oracle announced the first development milestone release of MySQL Cluster 7.3. What is new?

    Tomas Ulin: The release comprises:
    • Development Release 1: MySQL Cluster 7.3 with Foreign Keys. This has been one of the most requested enhancements to MySQL Cluster – enabling users to simplify their data models and application logic – while extending the range of use-cases.
    • Early Access “Labs” Preview: MySQL Cluster NoSQL API for Node.js. Implemented as a module for the V8 engine, the new API provides Node.js with a native, asynchronous JavaScript interface that can be used to both query and receive results sets directly from MySQL Cluster, without transformations to SQL. This gives lower latency for simple queries, while also allowing developers to build “end-to-end” JavaScript based services – from the browser, to the web/application layer through to the database, for less complexity.
    • Early Access “Labs” Preview: MySQL Cluster Auto-Installer. Implemented with a standard HTML GUI and Python-based web server back-end, the Auto-Installer intelligently configures MySQL Cluster based on application requirements and available hardware resources. This makes it simple for DevOps teams to quickly configure and provision highly optimized MySQL Cluster deployments – whether on-premise or in the cloud.

    Additional details can be found here.

    Q7.  John Busch, previously CTO of former Schooner Information Technology commented in a recent interview (1): “legacy MySQL does not scale well on a single node, which forces granular sharding and explicit application code changes to make them sharding-aware and results in low utilization of severs”. What is your take on this?

    Tomas Ulin: Improving scalability on single nodes has been a significant area of development, for example MySQL 5.6 introduced a series of enhancements which have been further built upon in the server, optimizer and InnoDB storage engine. Benchmarks are showing close to linear scalability for systems with 48 cores / threads, with 230% higher performance than MySQL 5.5. Details are here.

    Q8.  Could you please explain in your opinion the trade-off between scaling out and scaling up? What does it mean in practice when using MySQL?

    Tomas Ulin: On the scalability front, MySQL has come a long way in the last five years and MySQL 5.6 has made huge improvements here – depending on your workload, you may scale well up to 32 or 48 cores. While the old and proven techniques can work well as well: you may use master -slave(s) replication and split your load by having writes on the master only, while reads on slave(s), or using “sharding” to partition data across multiple computers. The question here is still the same: do you hit any bottlenecks on a single (big) server or not?… – and if yes, then you may start to think which kind of “distributed” solution is more appropriate for you. And it’s not only MySQL server related — similar problems and solutions are coming with all applications today targeting a high load activity.

    Some things to consider…
    Scale-up pros:
    – easier management
    – easier to achieve consistency of your data

    Scale-up cons:
    – you need to dimension your server up-front, or take into account the cost of throwing away your old hardware when you scale
    – cost/performance is typically better on smaller servers
    – in the end higher cost
    – at some point you reach the limit, i.e. you can only scale-up so much

    Scale-out pros:
    – you can start small with limited investment, and invest incrementally as you grow, reusing existing servers
    – you can choose hardware with the optimal cost/performance

    Scale-out cons:
    – more complicated management
    – you need to manage data and consistency across multiple servers, typically by making the application/middle tier aware of the data distribution and server roles in your scale-out (choosing MySQL Cluster for scale-out does not incur this con, only if you choose a more traditional master-slave MySQL setup)

    Q9.  How can you obtain scalability and high-performance with Big Data and at the same time offer SQL joins?

    Tomas Ulin: Based on estimates from leading Hadoop vendors, around 80 percent of their deployments are integrated with MySQL.

    As discussed above, there has been significant development in NoSQL APIs to InnoDB and MySQL Cluster storage engines which allow high speed ingestion of high velocity data Key/Value data, but which also allows complex queries, including JOIN operations to run across that same data set using SQL.

    Technologies like Apache Sqoop are commonly used to load data to and from MySQL and Hadoop, so many users will JOIN structured, relational data from MySQL with unstructured data such as clickstreams within Map/Reduce processes within Hadoop. We are also working on our Binlog API to enable real time CDC with Hadoop. More on the Binlog API is here.

    Q10.  Talking about scalability and performance what are the main differences if the database is stored on hard drives, SAN, flash memory (Flashcache)? What happens when data does not fit in DRAM?

    Tomas Ulin: The answer depends on your workload.
    The most important question is: “how long is your active data set remaining cached?” — if not long at all, then it means your activity will remain I/O-bound, and having faster storage here will help. But, if the “active data set” is small enough to be cached most of the time, the impact of the faster storage will not be as noticeable. MySQL 5.6 delivers many changes improving performance on heavy I/O-bound workloads.

    ————–
    Mr. Tomas Ulin has been working with the MySQL Database team since 2003. He is Vice President for the MySQL Engineering team, responsible for the development and maintenance of the MySQL related software products within Oracle, such as the MySQL Server, MySQL Cluster, MySQL Connectors, MySQL Workbench, MySQL Enterprise Backup, and MySQL Enterprise Monitor. Prior to working with MySQL his background was in the telecom industry, working for the Swedish telecom operator Telia and Telecom vendor Ericsson. He has a Masters degree in Computer Science and Applied Physics from Case Western Reserve University and a PhD in Computer Science from the Royal Institute of Technology.

    Related Posts

    (1): A super-set of MySQL for Big Data. Interview with John Busch, Schooner. on February 20, 2012

    (2): Scaling MySQL and MariaDB to TBs: Interview with Martín Farach-Colton. on October 8, 2012.

    On Eventual Consistency– Interview with Monty Widenius.on October 23, 2012

    Resources

    ODBMS.org: Relational Databases, NewSQL, XML Databases, RDF Data Stores:
    Blog Posts | Free Software | Articles and Presentations| Lecture Notes | Tutorials| Journals |

    Follow us on Twitter: @odbmsorg

    ##

    ]]>
    http://www.odbms.org/blog/2013/02/mysql-state-of-the-union-interview-with-tomas-ulin/feed/ 7