nosql databases – ODBMS Industry Watch

On Designing and Building Enterprise Knowledge Graphs. Interview with Ora Lassila and Juan Sequeda

Roberto V. Zicari — Tue, 02 Nov 2021 10:55:51 +0000

“The limits of my language mean the limits of my world.” – Ludvig Wittgenstein

I have interviewed Ora Lassila, Principal Graph Technologist in the Amazon Neptune team at AWS and Juan Sequeda, Principal Scientist at data.world. We talked about knowledge graphs and their new book.

RVZ

Q1. You wrote a book titled “Designing and Building Enterprise Knowledge Graphs”. What was the main motivation for writing such a book?

Ora Lassila and Juan Sequeda: We wanted to tackle the topic of knowledge graphs more broadly than just from the technology standpoint. There is more than just technology (e.g., graph databases) when it comes to successfully building a knowledge graph.

Time and time again we see people thinking about knowledge graphs and jumping to the conclusion that they just need a graph database and start there. Not only is there more technology you need, but there are issues with people, processes, organizations, etc.

Q2. What are knowledge graphs and what are they useful for?

Ora Lassila and Juan Sequeda: We see knowledge graphs as a vehicle for data integration and to make data accessible within an organization. Note that when we say “accessible data”, we really mean this: accessible data = physical bits + semantics. The semantics part is really important, since no data is truly accessible unless you also understand what the data means and how to interpret it. We call this issue the “knowledge/data gap”; Chapter 1 of our book gets deep into this.

You could say that knowledge graphs are a way to “democratize” data: make data more accessible and understandable to people who are not technology experts.

Q3. Why connecting relational databases with knowledge graphs?

Ora Lassila and Juan Sequeda: Frankly, the majority of enterprise data is in relational databases, so this seemed like a very good way to scope the problem. At the beginning of our book we show examples of how data is connected today and frankly, it’s a pain. And it’s not just a technical pain, there are important social and organizational aspects to this.

Juan Sequeda: Understanding the relationship between relational databases and the semantic web/knowledge graphs has been my quest since my undergraduate years. The title of my PhD dissertation is “Integrating Relational Databases with the Semantic Web”. Therefore I can say that this is a passion of mine.

Q4. Does it make more sense to use a native graph database instead or a NoSQL database?

Ora Lassila and Juan Sequeda: There is always the question “why use X instead of Y?”… and the answer almost always is “it depends”. We even bring this up in the foreword: As computer scientists we understand that there are many technologies that can be used to solve any particular problem. Some are easier, more convenient, and others are not. Just because you can write software in assembly language does not mean you shouldn’t seek to use a high-level programming language. Same with databases: find one that suits your purpose best.

Q5. What are the typical roles within an organization responsible for the knowledge graph?

Ora Lassila and Juan Sequeda: Organizations really need to get into the mindset of treating data as a product. When you acknowledge this, you realize you need the roles for designing, implementing and managing products, in this case data products. We see upcoming roles such as data product managers and knowledge scientists (i.e. Knowledge Engineers 2.0). We get into this in Chapter 4 of our book.

Q6. Data and knowledge are often in silos. Sharing knowledge and data is sometimes hard in an enterprise. What are the technical and non technical reasons for that?

Ora Lassila and Juan Sequeda: Technical problems are solvable, and many solutions exist. That said, we think knowledge graphs are really addressing this issue nicely.

The non-technical issues are an interesting challenge, and in many ways more difficult: people and process, organizational structure, centralization vs decentralization, etc. One specific issue that shows up all the time is this: If you want to share knowledge within a broader organization, you have to cross organizational boundaries, and that lands you on someone else’s “turf”. There is a great deal of diplomacy that is needed to tackle these kinds of issues.

Q7. When is it more appropriate to use RDF graph technologies instead of native property graph technologies?

Ora Lassila and Juan Sequeda: First, we object to the notion of “native” when it comes to property graphs, they are no more native than RDF graphs.

These are two slightly different approaches to building graphs. Ultimately, the question is not all that interesting. A more interesting question is: When should you use a graph as opposed to something else? If you do decide to use a graph, there are a lot of considerations and modeling decisions before you even come to the question of RDF vs. property graphs.

Of course, RDF is better suited to some situations (e.g., when you use external data, or have to merge graphs from different sources). Try using property graphs there and you merely end up re-inventing mechanisms that are already part of RDF. On the other hand, property graphs often appeal more to software developers, thanks to available access mechanisms and programming language support (e.g., Gremlin).

Q8. How can enterprises successfully adopt knowledge graphs to integrate data and knowledge, without boiling the ocean?

Ora Lassila and Juan Sequeda: First of all, you can’t build enterprise knowledge graphs in a “boil the ocean” approach. No chance in hell. You first need to break the problem in smaller pieces, by business units and use cases. This ultimately is a people and process problem. The tech is already here.

That said, there is a certain “build it and they will come” aspect to knowledge graphs. You should think of them more as a platform rather than as an application. Start by knowing some use cases, and gradually generalize and widen your scope. But you need to be solving some pressing problems for the business. Spend time understanding the problems, the limitations of their current solutions (assuming they are somewhat viable) and finding a champion (i.e. “if you can solve this problem better/faster/etc, I’m all ears!”). Also try to avoid educating on the technology: Business units don’t care if their problem is solved with technology A, B or C… all they want is for their problem to be solved.

Q9. Knowledge graphs and AI. Is there any relationships between them?

Ora Lassila and Juan Sequeda: Yes. Knowledge Graphs are a modern solution to a long-time (and in some ways, “ultimate”) goal in computer science: to integrate data and knowledge at scale. For at least the past half century, we’ve seen independent and integrated contributions coming from the AI community (namely knowledge representation, a subfield of classical AI) and the data management community. See section 1.3 of the book.

Qx Anything else you wish to add?

Ora Lassila and Juan Sequeda: We see a lot of what Albert Einstein gave as the definition of insanity: Doing the same thing over and over, and expecting different results. We need to do something truly different. But this is challenging for many reasons, not least because of this:

“The limits of my language mean the limits of my world.” – Ludvig Wittgenstein

For example, if SQL is your language, it may be very hard for you to see that there are some completely different ways of solving problems (case in point: graphs and graph databases).

Another challenge is that there are hard people and process issues, but as technologists we are wired to focus on technology, and to seek how to scale and automate.

Finally, we think the “graph industry” needs to evolve past the RDF vs. property graphs issue. Most people do not care. We need graphs. Period.

………………………………………..

Dr. Ora Lassila, Principal Graph Technologist in the Amazon Neptune team at AWS, mostly focusing on knowledge graphs. Earlier, he was a Managing Director at State Street, heading their efforts to adopt ontologies and graph databases. Before that, he worked as a technology architect at Pegasystems, as an architect and technology strategist at Nokia Location & Commerce (aka HERE), and prior to that he was a Research Fellow at the Nokia Research Center Cambridge. He was an elected member of the Advisory Board of the World Wide Web Consortium (W3C) in 1998-2013, and represented Nokia in the W3C Advisory Committee in 1998-2002. In 1996-1997 he was a Visiting Scientist at MIT Laboratory for Computer Science, working with W3C and launching the Resource Description Framework (RDF) standard; he served as a co-editor of the RDF Model and Syntax specification.

Juan Sequeda, Principal Scientist at data.world. He holds a PhD in Computer Science from The University of Texas at Austin. Juan’s goal is to reliably create knowledge from inscrutable data. His research and industry work has been on designing and building Knowledge Graph for enterprise data integration. Juan has researched and developed technology on semantic data virtualization, graph data modeling, schema mapping and data integration methodologies. He pioneered technology to construct knowledge graphs from relational databases, resulting in W3C standards, research awards, patents, software and his startup Capsenta (acquired by data.world). Juan strives to build bridges between academia and industry as the current co-chair of the LDBC Property Graph Schema Working Group, past member of the LDCB Graph Query Languages task force, standards editor at the World Wide Web Consortium (W3C) and organizing committees of scientific conferences, including being the general chair of The Web Conference 2023. Juan is also the co-host of Catalog and Cocktails, an honest, no-bs, non-salesy podcast about enterprise data.

Resources

Designing and Building Enterprise Knowledge Graphs Synthesis Lectures on Data, Semantics, and Knowledge August 2021, 165 pages, (https://doi.org/10.2200/S01105ED1V01Y202105DSK020) Juan Sequeda, data.world; Ora Lassila, Amazon

Follow us on Twitter: @odbmsorg

On InterSystems Technology Vision. Interview with Scott Gnau

Roberto V. Zicari — Thu, 20 Jun 2019 12:43:35 +0000

“At some point, most companies come to the realization that the advanced technologies and innovation that allow them to improve business operations also generate increased amounts of data that existing legacy technology is unable to handle, resulting in the need for more new technology. It is a cyclical process that CIOs need to prepare for.” –Scott Gnau

InterSystems has appointed last month Scott Gnau to Head of their Data Platforms Business Unit. I have asked Scott a number of questions related to data management, what are his advices for Chief Information Officers, what is the positing of the InterSystems IRIS family of data platforms, and what is the technology vision ahead for the company’s Data Platforms business unit.

RVZ

Q1. What are the main lessons you have learned in more than 20 years of experience in the data management space?

Scott Gnau: The data management space is a people-centric business, whether you are dealing with long-time customers or developers and architects. The formation of a trusted relationship can be the difference between a potential customer selecting one vendor’s technology which comes with the benefit of partnering for long term success, over a similar competitor’s technology.

Throughout my career, I have also learned how risky data management projects can be. They essentially ensure the security, cleanliness and accuracy of an organization’s data. They are then responsible for scaling data-centric applications, which helps inform important business decisions. Data management is a very competitive space which is only becoming more crowded.

Q2. What is your most important advice for Chief Information Officers?

Scott Gnau: At some point, most companies come to the realization that the advanced technologies and innovation that allow them to improve business operations also generate increased amounts of data that existing legacy technology is unable to handle, resulting in the need for more new technology. It is a cyclical process that CIOs need to prepare for.

Phenomena such as big data, the internet of things (IoT), and artificial intelligence (AI) are driving the need for this modern data architecture and processing, and CIOs should plan accordingly. For the last 30 years, data was primarily created inside data centers or firewalls, was standardized, kept in a central location and managed. It was fixed and simple to process.
In today’s world, most data is created outside the firewall and outside of your control. The data management process is now reversed – instead of starting with business requirements, then sourcing data and building and adjusting applications, developers and organizations load the data first and reverse engineer the process. Now data is driving decisions around what is relevant and informing the applications that are built.

Q3. How do you position the InterSystems IRIS family of data platforms with respect to other similar products on the market?

Scott Gnau: The data management industry is crowded, but the InterSystems IRIS data platform is like nothing else on the market. It has a unique, solid architecture that attracts very enthusiastic customers and partners, and plays well in the new data paradigm. There is no requirement to have a schema to leverage InterSystems IRIS. It scales unlike any other product in the data management marketplace.

InterSystems IRIS has unique architectural differences that enable all functions to run in a highly optimized fashion, whether it be supporting thousands of concurrent requests, automatic and easy compression, or highly performant data access methods.

Q4. What is your strategy with respect to the Cloud?

Scott Gnau: InterSystems has a cloud-first mentality, and with the goal of easy provisioning and elasticity, we offer customers the choice for cloud deployments. We want to make the consumption model simple, so that it is frictionless to do business with us.

InterSystems IRIS users have the ability to deploy across any cloud, public or private. Inside the software it leverages the cloud infrastructure to take advantage of the new capabilities that are enabled because of cloud and containerized architectures.

Q5. What about Artificial Intelligence?

Scott Gnau: AI is the next killer app for the new data paradigm. With AI, data can tell you things you didn’t already know. While many of the mathematical models that AI is built on are on the older side, it is still true that the more data you feed them the more accurate they become (which fits well with the new paradigm of data). Generating value from AI also implies real time decisioning, so in addition to more data, more compute and edge processing will define success.

Q6. How do you plan to help the company’s customers to a new era of digital transformation?

Scott Gnau: My goal is to help make technology as easy to consume as possible, to ensure that it is highly dependable. I will continue to work in and around vertical industries that are easily replicable.

Q7. What customers are asking for is not always what customers really need. How do you manage this challenge?

Scott Gnau: Disruption in the digital world is at an all-time high, and for some, impending change is sometimes too hard to see before it is too late. I encourage customers to be ready to “rethink normal,” while putting them in the best position for any transitions and opportunities to come. At the same time, as trusted partners we also are a source of advice to our customers on mega trends.

Q8. What is your technology vision ahead for the company’s Data Platforms business unit?

Scott Gnau: InterSystems continues to look for ways to differentiate how our technology creates success for our customers. We judge our success on our customers’ successes. Our unique architecture and overall performance envelope plays very well into data centric applications across multiple industries including financial services, logistics and healthcare. With connected devices and the requirement for augmented transactions we play nicely into the future high value application space.

Q9. What do you expect from your new role at InterSystems?

Scott Gnau: I expect to have a lot of fun because there is an infinite supply of opportunity in the data management space due to the new data paradigm and the demand for new analytics. On top of that, InterSystems has many smart, passionate and loyal customers, partners and employees. As I mentioned up front, it’s about a combination of great tech AND great people that drives success. Our ability to invest in the future is extremely strong – we have all the key ingredients.

————————————————-

Scott Gnau, Vice President, Data Platforms, InterSystems

Scott Gnau joined InterSystems in 2019 as Vice President of Data Platforms, overseeing the development, management, and sales of the InterSystems IRIS family of data platforms. Gnau brings more than 20 years of experience in the data management space helping lead technology and data architecture initiatives for enterprise-level organizations. He joins InterSystems from HortonWorks, where he served as chief technology officer. Prior to Hortonworks, Gnau spent two decades at Teradata in increasingly senior roles, including serving as president of Teradata Labs. Gnau holds a Bachelor’s degree in electrical engineering from Drexel University.

Resources

– InterSystems Appoints Scott Gnau to Head of Data Platforms Business Unit. CAMBRIDGE, Mass., May 6, 2019

– InterSystems IRIS

– InterSystems IRIS for Health

Related Posts

– On AI, Big Data, Healthcare in China. Q&A with Luciano Brustia ODBMS.org, 8 APR, 2019.

Follow us on Twitter: @odbmsorg

On Kubernetes. Interview with Eric Tune

Roberto V. Zicari — Mon, 18 Feb 2019 10:07:33 +0000

“Perhaps less obvious is how role definitions in an organization change as scale increases. Once rare tasks that were just a small part of one team’s responsibilities become so common that they are a full-time job for someone. At that point, one either needs to create automation for the task, or a new team needs to be assembled (or hired) to perform that task full time. ” — Eric Tune

I have interviewed Eric Tune, Senior Staff Engineer at Google. We talked about Kubernetes. Eric has been a Kubernetes contributor since 1.0

RVZ

Q1. What are the main technical challenges in implementing massive-scale environments?

Eric Tune: Whether working at small or massive scale, the high-level technical goals don’t change: security, developer velocity, efficiency in use of compute resources, supportability of production environments, and so on.

As scale increases, there are some fairly obvious discontinuities, like moving from an application that fits on a single-machine to one that spans multiple machines, and from a single data center or zone to multiple regions. Quite a bit has been written about this. Microservices in particular can be a good fit because they scale well to more machines and more regions.

Perhaps less obvious is how role definitions in an organization change as scale increases. Once rare tasks that were just a small part of one team’s responsibilities become so common that they are a full-time job for someone. At that point, one either needs to create automation for the task, or a new team needs to be assembled (or hired) to perform that task full time. Sometimes, it is obvious how to do this. But, when this repeats many times, one can end up with a confusing mess of automation and tickets, dragging down development velocity and confounding attempts to analyze security and debug systemic failure.

So, a key challenge is finding the right separation responsibilities so that multiple pieces of automation, and multiple human teams collaborate well. Doing it requires not only having a broad view of an organization’s current processes and responsibilities around development, operations, and security; but also which assumptions behind those are no longer valid.

Kubernetes can help hereby providing automation for exactly the types of tasks that become toilsome as scale increases. Several of its creators have lived through organic growth to a massive-scale. Kubernetes is built from that experience, with awareness of the new roles that are needed at massive-scale.

Q2. What is Kubernetes and why is it important?

Eric Tune: First, Kubernetes is one of the most popular ways to deploy applications in containers. Containers make the act of maintaining the machine & operating system a largely separate process from installing and maintaining an application instance – no more worrying about shared library or system utility version differences.

Second, it provides an abstraction over IaaS: VMs, VM images, VM types, load balancers, block storage, auto-scalers, etc. Kubernetes runs on numerous clouds, on-premises, and on a laptop. Many complex applications, such as those consisting of many microservices, can be deployed onto any Kubernetes cluster regardless of the underlying infrastructure. For an organization that may want to modernize their applications now, and move to cloud later, targeting Kubernetes means they won’t need to re-architect when they are ready to move. Third, Kubernetes supports infrastructure-as-code (IaC). You can define complex applications, including storage, networking, and application identity, in a common configuration language, called the Kubernetes Resource model. Unlike other IaC systems, which mostly support a “single-user” model, Kubernetes is designed for multiple users. It supports controlled delegation of responsibility from an ops team to a dev team.

Fourth, it provides an opinionated way to build distributed system control planes, and to extend the APIs and infrastructure-as code type system. This allows solution vendors and in-house infrastructure teams to build custom solutions that feel like they are first class parts of Kubernetes.

Q3. Who should be using Kubernetes?

Eric Tune: If your organization runs Linux-based microservices and has explored container technology, then you are ready to try Kubernetes.

Q4. You are a Kubernetes contributor since 1.0 (4 years). What did you work on specifically?

Eric Tune: During the first year, I worked on whatever needed to be done, including security (namespaces, service accounts, authentication and authorization, resource quota), performance, documentation, testing, API review and code review.

In those first years, people were mostly running stateless microservices on Kubernetes. In the second year, I worked to broaden the set of applications that can run on Kubernetes. I worked on the Job and CronJob APIs of Kubernetes, which support basic batch computation, and the StatefulSet API, which supports databases and other stateful applications. Additionally, I worked with the Helm project on Charts (easy-to-install applications for Kubernetes), with the Spark open source community to get it running on Kubernetes.

Starting in 2017, Kubernetes interest was growing so quickly that the project maintainers could not accept a fraction of the new features that were proposed. The answer was to make Kubernetes extensible so that new features could be build “out of the core.” I worked to define the extensibility story for Kubernetes, particularly for Custom Resource Definitions (CRDs) and Webhooks. The extensibility features of Kubernetes have enabled other large projects, such as Istio and Knative, to integrate with Kubernetes with lower overhead for the Kubernetes project maintainers.

Currently, I lead teams which work on both Open Source Kubernetes and Google Cloud.

Q5. What are the main challenges of migrating several microservices to Kubernetes?

Eric Tune: Here are three challenges I see when migrating several microservices to Kubernetes, and how I recommend handling them:

Remove Ordering Dependencies: Say microservice C depends on microservices A and B to function normally. When migrating to declarative configuration and Kubernetes, the startup order for microservices can become variable, where previously it was ordered (e.g. by a script). This can cause unexpected behaviors. For example, microservice C might log errors at a high rate or crash if A is not ready yet. A first reaction is sometimes “how can I guarantee ordering of microservice startup,” My advice is not to impose order, but to change problematic behavior. For example, C could be changed to return some response for a request even when A and B are unreachable. This is not really a Kubernetes-specific requirement – it is a good practice for microservices, as it allows for graceful recovery from failures and for autoscaling.
Don’t Persist Peer Network Identity: Some microservices permanently record the IP addresses of their peers at startup time, and then don’t expect it to ever change. That’s not a great match for the Kubernetes approach to networking. Instead, resolve peer addresses using their domain names and re-resolve after disconnection.
Plan ahead for Running in Parallel: When migrating a complex set of microservices to Kubernetes, it’s typical to run the entire old environment and the new (Kubernetes) environment in parallel. Make sure you have load replay and response diffing tools to evaluate a dual environment setup.

Q6. How can Kubernetes scale without increasing ops team?

Eric Tune: Kubernetes is built to respond to many types of application and infrastructure failures automatically – for example slow memory leaks in an application, or kernel panics in a virtual machine. Previously this kind of problem may have required immediate attention. With Kubernetes as the first line of defense, ops can wait for more data before taking action. This in turn supports faster rollouts, as you don’t need to soak as long if you know that slow memory leaks will be handled automatically, and you can fix by rolling forward rather than back.

Some ops teams also face multiple deployment environments, including multi-cloud, hybrid, or varying hardware in on-premises datacenters. Kubernetes hides somes differences between these, reducing the number of variations of configuration that is needed.

A pattern I have seen is role specialization within ops teams, which can bring efficiencies. Some members specialize in operating the Kubernetes cluster itself, what I call a “Cluster Operations” role, while others specialize in operating a set of applications (microservices). The clean separation between infrastructure and application – in particular the use of Kubernetes configuration files as a contract between the two groups – supports this separation of duties.

Finally, if you are able to choose a hosted version of Kubernetes such as Google Container Engine (GKE), then the hosted service takes on much of the Cluster Operations role. (Note: I work on GKE.)

Q7. On-premises, hybrid, or public cloud infrastructure: which solutions would you think is it better for running Kubernetes?

Eric Tune: Usually factors unrelated to Kubernetes will determine if an application needs to run on-premises, such as data sovereignty, latency concerns or an existing hardware investment. Often some applications need to be on-premises and some can move to public cloud. In this case you have a hybrid Kubernetes deployment, with one or more clusters on-premises, and one or more clusters on public cloud. For application operators and developers, the same tools can be used in all the clusters. Applications in different clusters can be configured to communicate with each other, or to be separate, as security needs dictate. Each cluster is a separate failure domain. One does not typically have a single cluster which spans on-premises and public cloud.

Q8. Kubernetes is open source. How can developers contribute?

Eric Tune: We have 150+ associated repositories that are all looking for developers (and other roles) to contribute. If you want to help but aren’t sure what you want to work on, then start with the Community ReadMe, and come to the community meetings or watch a rerun. If you think you already know what area of Kubernetes you are interested in, then start with our contributors guide, and attend the relevant Special Interest Group (SIG) meeting.

—————————–

Dr. Eric Tune is a Senior Staff Engineer at Google. He leads dozens of engineers working on Kubernetes and GKE. He has been a Kubernetes contributor since 1.0. Previously at Google he worked on the Borg container orchestration system, drove company-wide compute efficiency improvements, created the Google-wide Profiling system, and helped expand the size of Google’s search index. Prior to Google, he was active in computer architecture research. He holds computer engineering degrees (PhD, MS, BS) from UCSD .

On Using Graph Database technology at Behance. Interview with David Fox

Roberto V. Zicari — Mon, 16 Jul 2018 12:39:37 +0000

“We’ve corrected a major human resource burden that has led to an exponential decrease in the amount of developer-operations staff hours required each month to keep our activity feed running. This means they can focus on other areas of our infrastructure that need attention, versus spending time frustratingly micro-managing the activity infrastructure. ” –David Fox

I have interviewed David Fox. David is a Software Engineer at Adobe, responsible for backend infrastructure and performance of Behance – a social network for creatives, serving over 10 million members. Behance is owned by Adobe.

RVZ

Q1. Who is using Behance and for what? How many users do you have, and what data volume and data types do you handle?

David Fox: Behance is the leading online platform to showcase and discover creative work. Every single day, hundreds of thousands of creative individuals update and publish new projects to the Behance platform allowing for advanced collaboration and an efficient means to solicit helpful feedback on both final work and ‘works in progress.’
Not only does Behance offer an outlet for peer-to-peer sharing among creatives, it’s a top destination for companies looking to hire creative talent on a global scale. With over 10 million members, our ‘activity feed’ infrastructure serves millions of feed loads daily.

Q2. What specific data requirements do you have from your users?

David Fox: Our users expect the ability to access current and past projects within Behance that’s both relevant to their core interest areas while also offering discovery tools/functionality. The ‘Activity Feed Feature’ allows users to follow their favorite creatives and curate galleries based on preferences. When a user follows a creative, they receive alerts and an updated feed whenever that creative logs an activity within Behance, e.g. when they ‘appreciate’ a project, publish a project, comment on a project. Users can also follow pre-curated galleries, selected by the Behance curation team, which highlight a creative theme, e.g. graphic design, photography, illustration.

Q3. You have implemented an activity feed for many years. What challenges did you have with the previous implementation?

David Fox: Our previous implementation of ‘Activity Feed’ was built with Cassandra. It was designed for optimized ‘reads’ (when we showed a user their activity feed), but consequently, it was very storage/write-heavy.
We needed 48 large Cassandra instances to power the feature, and even with those, there were still many limitations and bottlenecks on the system. We had to devote a large amount of app resources to populate Cassandra with activity feed data. For every action a user took on our app, we would use a “fanout” method to populate the activity feed of every user that followed them with the new activity item. For users who were followed by thousands of people, resource utilization skyrocketed every time they took an action. Our application worker processes that processed those items would experience delays in their work because of all the work they needed to do to populate the activity feeds in Cassandra.

We also had a lot of challenges maintaining our Cassandra cluster, which led to us having to devote a significant amount of ops/developer time to supporting the cluster. With our schema format, we didn’t have an efficient way to delete feed items from our database, so the disk usage began to add up quickly on each node. When disk usage became high, we would have to perform maintenance tasks to stabilize the cluster. The rigid schema structure also meant we couldn’t easily make any improvements to our activity feed feature, and we were just working to keep it running with little hope of improving it.

Q4. What are the goals you set for the new implementation of the activity feed? How does the activity feed relate to the rest of the data platform?

David Fox: We had several goals for our new activity feed implementation.

Ensure the new infrastructure significantly reduced human-maintenance costs and required minimal effort and resources to keep running.
Reduce the complexity of the system as a whole.
Significantly improve the performance of writes while keeping reads fast.
Increase the flexibility of the system, in turn making it easier to add new features.
Reduce data storage size.

Q5. Can you share with us some details of the new implementation?

David Fox: Our new activity feed implementation, built on top of Neo4j, uses a simple graph model where we store relationships between users and the entities they follow. We then store simple “action” relationships that represent a user or curated gallery taking an action on a project (ex. commenting, publishing, appreciating, etc.). This data model produces very little repeating data so data maintenance is simple for our application layer, doesn’t use a lot of resources, and provides good flexibility on how we can query the data.

Q6. Why was removing activity items when you unfollowed a user “impossible” with the previous implementation? How is it done now? What about scalability?

David Fox: In our previous implementation, our Cassandra schema didn’t allow for us to efficiently delete data at any scale since it was highly optimized for reads, and there was significant data repetition. So, when a user unfollowed another user, since their activity feed was pre-written to Cassandra in-full, they would still see items from users they unfollowed. Now, we simply delete the “follows” relationship from the user to the person they unfollowed, instantly reflecting the change in their feed. This instant change is possible since we don’t store any data about individual activity feeds, and instead, only store data modeling on who users follow and actions taken on projects.

Q7. In the previous implementation you could only “backfill” 30 items per category during on boarding. Why was this a problem? How is it done now?

David Fox: This limitation was a significant struggle for us. When a user creates an account with Behance, they select one or more curated categories they want to follow. At that point, with our old implementation, we would have to backfill their feed with activity items since all the items we could show them needed to be stored in Cassandra – one row for each project in their feed. So, when a user signed up, they would need to wait the amount of time it took for us to fill Cassandra with activity items for the curated categories they selected. Because of that delay, we minimized the number of projects we backfilled to make sure there wasn’t too long of a delay (only a few seconds) between finishing sign-up and seeing the initial Behance experience. Unfortunately, limiting the backfill to 30 items meant after sign-up, a user could only scroll through 30 x (number of categories selected) items on their feed, and then they’d reach the end.

Now, with our new activity feed implementation, there is no backfill required at all. When a user creates an account and selects the curated categories they want to follow, all we have to do is create the “follows” relationships in Neo4j between the user and the categories, and their feed will instantly be available. Since we store about 1,000 latest items for each curated category in Neo4j, that means that users can now almost instantly see 1,000 x (number of categories selected) items in their feed. That’s a much better initial experience on our application!

Q8. How did you implement the ability to add new features like “newest projects from your network” ?

David Fox: The flexibility of the graph data model allowed us to easily add this new activity view to the activity feed feature. Instead of having to load a user’s activity based on a pre-existing stored structure, we were able to simply query by action type in Neo4j to create a specific view of the activity feed, analyzing only “published project” actions instead of all our usual activity actions.

Q9. Why is selectively writing for users with lots of followers no longer needed?

David Fox: Previously, if a user was followed by a significant amount of people, we were unable to share an activity update to the user’s full group of followers due to the ‘fan out’ action for each follower. Now, the number of people following a user has no impact on the time/complexity for creating the action since the same action relationship is shared by all followers of that user – meaning very little repeating data.

Q10. Can you share any numbers around the performance of the new implementation?

David Fox: We’ve seen some significant performance improvements with the new implementation. We’ve been able to cut the time from sign-up to initial activity experience from 1.4 seconds to 400 milliseconds on average. In addition to the speed improvement, users’ initial curated feed is now much more complete.

We’ve seen the most significant improvements in writing activity data. For example, one of our write job processes (when a project was featured in a curated gallery) used to take 12 minutes on average to run and consumed significant application resources. Now, on average, that write operation takes 106 milliseconds.

Q11. What about COGS, Storage Costs and Complexity?

David Fox: Our Neo4j activity implementation has led to a great decrease in complexity, storage, and infrastructure costs. Our full dataset size is now around 40 GB, down from 50 TB of data that we had stored in Cassandra. We’re able to power our entire activity feed infrastructure using a cluster of 3 Neo4j instances, down from 48 Cassandra instances of pretty much equal specs. That has also led to reduced infrastructure costs. Most importantly, it’s been a breeze for our operations staff to manage since the architecture is simple and lean.

Q12. What is the road map ahead for data projects at Behance?

David Fox: We want to continue to take advantage of our new, flexible graph infrastructure to improve our activity feed feature and make it more useful. We recently built “newest projects from your network” view for activity and we will be launching it soon. This feature is a great example of our ability to move quickly and nimbly, and we’ll continue to add more compelling features to our activity experience moving forward.

Just as important as new features, we’ve corrected a major human resource burden that has led to an exponential decrease in the amount of developer-operations staff hours required each month to keep our activity feed running. This means they can focus on other areas of our infrastructure that need attention, versus spending time frustratingly micro-managing the activity infrastructure.

———————————-

David Fox is an application developer/data engineer specializing in the development of high-performance backend systems and working with a large variety of databases alongside massive datasets. He is a Software Engineer at Adobe, responsible for backend infrastructure and performance of Behance – the premier social network for creatives, serving over 10 million members.

Resources

GRAPH DATA STORES – Free Downloads for:

Blog Posts

Free Software

Articles, Papers, Presentations

Tutorials, Lecture Notes

PhD and Master Thesis

Related Posts

– On RDBMS, NoSQL and NewSQL databases. Interview with John Ryan , ODBMS Industry Watch, March 9, 2018

– Identity Graph Analysis at Scale. Interview with Niels Meersschaert, ODBMS Industry Watch, May 9, 2017

Follow us on Twitter: @odbmsorg

On the future of Data Warehousing. Interview with Jacque Istok and Mike Waas

Roberto V. Zicari — Thu, 09 Nov 2017 08:54:27 +0000

” Open source software comes with a promise, and that promise is not about looking at the code, rather it’s about avoiding vendor lock-in.” –Jacque Istok.

” The cloud has out-paced the data center by far and we should expect to see the entire database market being replatformed into the cloud within the next 5-10 years.” –Mike Waas.

I have interviewed Jacque Istok, Head of Data Technical Field for Pivotal, and Mike Waas, founder and CEO Datometry.
Main topics of the interview are: the future of Data Warehousing, how are open source and the Cloud affecting the Data Warehouse market, and Datometry Hyper-Q and Pivotal Greenplum.

RVZ

Q1. What is the future of Data Warehouses?

Jacque Istok: I believe that what we’re seeing in the market is a slight course correct with regards to the traditional data warehouse. For 25 years many of us spent many cycles building the traditional data warehouse.
The single source of the truth. But the long duration it took to get alignment from each of the business units regarding how the data related to each other combined with the cost of the hardware and software of the platforms we built it upon left everybody looking for something new. Enter Hadoop and suddenly the world found out that we could split up data on commodity servers and, with the right human talent, could move the ball forward faster and cheaper. Unfortunately the right human talent has proved hard to come by and the plethora of projects that have spawned up are neither production ready nor completely compliant or compatible with the expensive tools they were trying to replace.
So what looks to be happening is the world is looking for the features of yesterday combined with the cost and flexibility of today. In many cases that will be a hybrid solution of many different projects/platforms/applications, or at the very least, something that can interface easily and efficiently with many different projects/platforms/applications.

Mike Waas: Indeed, flexibility is what most enterprises are looking for nowadays when it comes to data warehousing. The business needs to be able to tap data quickly and effectively. However, in today’s world we see an enormous access problem with application stacks that are tightly bonded with the underlying database infrastructure. Instead of maintaining large and carefully curated data silos, data warehousing in the next decade will be all about using analytical applications from a quickly evolving application ecosystem with any and all data sources in the enterprise: in short, any application on any database. I believe data warehouses remain the most valuable of databases, therefore, cracking the access problem there will be hugely important from an economic point of view.

Q2. How is open source affecting the Data Warehouse market?

Jacque Istok: The traditional data warehouse market is having its lunch eaten by open source. Whether it’s one of the Hadoop distributions, one of the up and coming new NoSQL engines, or companies like Pivotal making large bets and open source production proven alternatives like Greenplum. What I ask prospective customers is if they were starting a new organization today, what platforms, databases, or languages would you choose that weren’t open source? The answer is almost always none. Open source software comes with a promise, and that promise is not about looking at the code, rather it’s about avoiding vendor lock-in.

Mike Waas: Whenever a technology stack gets disrupted by open source, it’s usually a sign that the technology has reached a certain maturity and customers have begun doubting the advantage of proprietary solutions. For the longest time, analytical processing was considered too advanced and too far-reaching in scope for an open source project. Greenplum Database is a great example for breaking through this ceiling: it’s the first open source database system with a query optimizer not only worth that title but setting a new standard, and a whole array of other goodies previously only available in proprietary systems.

Q3. Are databases an obstacle to adopting Cloud-Native Technology?

Jacque Istok: I believe quite the contrary, databases are a requirement for Cloud-Native Technology. Any applications that are created need to leverage data in some way. I think where the technology is going is to make it easier for developers to leverage whichever database or datastore makes the most sense for them or they have the most experience with – essentially leveraging the right tool for the right job, instead of the tool “blessed” by IT or Operations for general use. And they are doing this by automating the day 0, day 1, and day 2 operations of those databases. Making it easy to instantiate and use these platforms for anyone, which has never really been the case.

Mike Waas: In fact, a cloud-first strategy is incomplete unless it includes the data assets, i.e., the databases. Now, databases have always been one of the hardest things to move or replatform, and, naturally, it’s the ultimate challenge when moving to the cloud: firing up any new instance in the cloud is easy as 1-2-3 but what to do with the 10s of years of investment in application development? I would say it’s actually not the database that’s the obstacle but the applications and their dependencies.

Q4. What are the pros and cons of moving enterprise data to the cloud?

Jacque Istok: I think there are plenty of pros to moving enterprise data to the cloud, the extent of that list will really depend on the enterprise you’re talking to and the vertical that they are in. But cons? The only cons would be using these incredible tools incorrectly, at which point you might find yourself spending more money and feeling that things are slower or less flexible. Treating the cloud as a virtual data center, and simply moving things there without changing how they are architected or how they are used would be akin to taking

Mike Waas: I second that. A few years ago enterprises were still concerned about security, completeness of offering, and maturity of the stack. But now, the cloud has out-paced the data center by far and we should expect to see the entire database market being replatformed into the cloud within the next 5-10 years. This is going to be the biggest revolution in the database industry since the relational model with great opportunities for vendors and customers alike.

Q5. How do you quantify when is appropriate for an enterprise to move their data management to a new platform?

Jacque Istok: It’s pretty easy from my perspective, when any enterprise is done spending exorbitant amounts of money it might be time to move to a new platform. When you are coming up on a renewal or an upgrade of a legacy and/or expensive system it might be time to move to a new platform. When you have new initiatives to start it might be time to move to a new platform. When you are ready to compete with your competitors, both known and unknown (aka startups), it might be time to move to a new platform. The move doesn’t have to be scary either, as some products are designed to be a bridge to a modern a data platform.

Mike Waas: Traditionally, enterprises have held off from replatforming for too long: the switching cost has deterred them from adopting new and highly superior technology with the result that they have been unable to cut costs or gain true competitive advantage. Staying on an old platform is simply bad for business. Every organization needs to ask themselves constantly the question whether their business can benefit from adopting new technology. At Datometry, we make it easy for enterprises to move their analytics — so easy, in fact, the standard reaction to our technology is, “this is too good to be true.”

Q6. What is the biggest problem when enterprises want to move part or all of their data management to the cloud?

Jacque Istok: I think the biggest problem tends to be not architecting for the cloud itself, but instead treating the cloud like their virtual data center. Leveraging the same techniques, the same processes, and the same architectures will not lead to the cost or scalability efficiencies that you were hoping for.

Mike Waas: As Jacque points out, you really need to change your approach. However, the temptation is to use the move to the cloud as a trigger event to rework everything else at the same time. This quickly leads to projects that spiral out of control, run long, go over budget, or fail altogether. Being able to replatform quickly and separate the housekeeping from the actual move is, therefore, critical.
However, when it comes to databases, trouble runs deeper as applications and their dependencies on specific databases are the biggest obstacle. SQL code is embedded in thousands of applications and, probably most surprising, even third-party products that promise portability between databases get naturally contaminated with system-specific configuration and SQL extensions. We see roughly 90% of third-party systems (ETL, BI tools, and so forth) having been so customized to the underlying database that moving them to a different system requires substantial effort, time, and money.

Q7. How does an enterprise move the data management to a new platform without having to re-write all of the applications that rely on the database?

Mike Waas: At Datometry, we looked very carefully at this problem and, with what I said above, identified the need to rewrite applications each time new technology is adopted as the number one problem in the modern enterprise. Using Adaptive Data Virtualization (ADV) technology, this will quickly become a problem of the past! Systems like Datometry Hyper-Q let existing applications run natively and instantly on a new database without requiring any changes to the application. What would otherwise be a multi-year migration project and run into the millions, is now reduced in time, cost, and risk to a fraction of the conventional approach. “VMware for databases” is a great mental model that has worked really well for our customers.

Q8. What is Adaptive Data Virtualization technology, and how can it help adopting Cloud-Native Technology?

Mike Waas: Adaptive Data Virtualization is the simple, yet incredibly powerful, abstraction of a database: by intercepting the communication between application and database, ADV is able to translate in real-time and dynamically between the existing application and the new database. With ADV, we are drawing on decades of database research and solving what is essentially a compatibility problem between programming languages and systems with an elegant and highly effective approach. This is a space that has traditionally been served by consultants and manual migrations which are incredibly labor-intensive and expensive undertaking.
Through ADV, adopting cloud technology becomes orders of magnitude simpler as it takes away the compatibility challenges that hamper any replatforming initiative.

Q9. Can you quantify what are the reduced time, cost, and risk when virtualizing the data warehouse?

Jacque Istok: In the past, virtualizing the data warehouse meant sacrificing performance in order to get some of the common benefits of virtualization (reduced time for experimentation, maximizing resources, relative ease to readjust the architecture, etc). What we have found recently is that virtualization, when done correctly, actually provides no sacrifices in terms of performance, and the only question becomes whether or not the capital cost expenditure of bare metal versus the opex cost structure of virtual is something that makes sense for your organisation.

Mike Waas: I’d like to take it a step further and include ADV into this context too: instead of a 3-5 year migration, employing 100+ consultants, and rewriting millions of lines of application code, ADV lets you leverage new technology in weeks, with no re-writing of applications. Our customers can expect to save at least 85% of the transition cost.

Q10. What is the massively parallel processing (MPP) Scatter/Gather Streaming technology, and what is it useful for?

Jacque Istok: This is arguably one of the most powerful features of Pivotal Greenplum and it allows for the fastest loading of data in the industry. Effectively we scatter data into the Greenplum data cluster as fast as possible with no care in the world to where it will ultimately end up. Terabytes of data per hour, basically as much as you can feed down the wires, is sent to each of the workers within the cluster. The data is therefore disseminated to the cluster in the fastest physical way possible. At that point, each of the workers gathers the data that is pertinent to them according to the architecture you have chosen for the layout of those particular data elements, allowing for a physical optimization to be leveraged during interrogation of the data after it has been loaded.

Q11. How Datometry Hyper-Q & Pivotal Greenplum data warehouse work together?

Jacque Istok: Pivotal Greenplum is the world’s only true open source, production proven MPP data platform that provides out of the box ANSI compliant SQL capabilities along with Machine Learning, AI, Graph, Text, and Spatial analytics all in one. When combined with Datometry Hyper-Q, you can transparently and seamlessly take any Teradata application and, without changing a single line of code or a single piece of SQL, run it and stop paying the outrageous Teradata tax that you have been bearing all this time. Once you’re able to take out your legacy and expensive Teradata system, without a long investment to rewrite anything, you’ll be able to leverage this software platform to really start to analyze the data you have. And that analysis can be either on premise or in the cloud, giving you a truly hybrid and cross-cloud proven platform.

Mike Waas: I’d like to share a use case featuring Datometry Hyper-Q and Pivotal Greenplum featuring a Fortune 100 Global Financial Institution needing to scale their business intelligence application, built using 2000-plus stored procedures. The customer’s analysis showed that replacing their existing data warehouse footprint was prohibitively expensive and rewriting the business applications to a more cost-effective and modern data warehouse posed significant expense and business risk. Hyper-Q allowed the customer to transfer the stored procedures in days without refactoring the logic of the application and implement various control-flow primitives, a time-consuming and expensive proposition.

Qx. Anything else you wish to add?

Jacque Istok: Thank you for the opportunity to speak with you. We have found that there has never been a more valid time than right now for customers to stop paying their heavy Teradata tax and the combination of Pivotal Greenplum and Datometry Hyper-Q allows them to do that right now, with no risk, and immediate ROI. On top of that, they are then able to find themselves on a modern data platform – one that allows them to grow into more advanced features as they are able. Pivotal Greenplum becomes their bridge to transforming your organization by offering the advanced analytics you need but giving you traditional, production proven capabilities immediately. At the end of the day, there isn’t a single Teradata customer that I’ve spoken to that doesn’t want Teradata-like capabilities at Hadoop-like prices and you get all this and more with Pivotal Greenplum.

Mike Waas: Thank you for this great opportunity to speak with you. We, at Datometry, believe that data is the key that will unlock competitive advantage for enterprises and without adopting modern data management technologies, it is not possible to unlock value. According to the leading industry group, TDWI, “today’s consensus says that the primary path to big data’s business value is through the use of so-called ‘advanced’ forms of analytics based on technologies for mining, predictions, statistics, and natural language processing (NLP). Each analytic technology has unique data requirements, and DWs must modernize to satisfy all of them.”
We believe virtualizing the data warehouse is the cornerstone of any cloud-first strategy because data warehouse migration is one of the most risk-laden and most expensive initiatives that a company can embark on during their journey to to the cloud.
Interestingly, the cost of migration is primarily the cost of process and not technology and this is where Datometry comes in with its data warehouse virtualization technology.
We are the key that unlocks the power of new technology for enterprises to take advantage of the latest technology and gain competitive advantage.

———————

Jacque Istok serves as the Head of Data Technical Field for Pivotal, responsible for setting both data strategy and execution of pre and post sales activities for data engineering and data science. Prior to that, he was Field CTO helping customers architect and understand how the entire Pivotal portfolio could be leveraged appropriately.
A hands on technologist, Mr. Istok has been implementing and advising customers in the architecture of big data applications and back end infrastructure the majority of his career.

Prior to Pivotal, Mr. Istok co-founded Professional Innovations, Inc. in 1999, a leading consulting services provider in the business intelligence, data warehousing, and enterprise performance management space, and served as its President and Chairman. Mr. Istok is on the board of several emerging startup companies and serves as their strategic technical advisor.

Mike Waas, CEO Datometry, Inc.
Mike Waas founded Datometry after having spent over 20 years in database research and commercial database development. Prior to Datometry, Mike was Sr. Director of Engineering at Pivotal, heading up Greenplum’s Advanced R&Dteam. He is also the founder and architect of Greenplum’s ORCA query optimizer initiative. Mike has held senior engineering positions at Microsoft, Amazon, Greenplum, EMC, and Pivotal, and was a researcher at Centrum voor Wiskunde en Informatica (CWI), Netherlands, and at Humboldt University, Berlin.

Mike received his M.S. in Computer Science from University of Passau, Germany, and his Ph.D. in Computer Science from the University of Amsterdam, Netherlands. He has authored or co-authored 36 publications on the science of databases and has 24 patents to his credit.

Resources

–Datometry Releases Hyper-Q Data Warehouse Virtualization Software Version 3.0. AUGUST 11, 2017

–Replatforming Custom Business Intelligence | Use Case, ODBMS.org, NOVEMBER 7, 2017

– Disaster Recovery Cloud Data Warehouse | Use Case. ODBMS.org, NOVEMBER 3, 2017

– Scaling Business Intelligence in the Cloud | Use Case. ODBMS.org · NOVEMBER 3, 2017

– Re-Platforming Data Warehouses – Without Costly Migration Of Applications. ODBMS.org · NOVEMBER 3, 2017

– Meet Greenplum 5: The World’s First Open-Source, Multi-Cloud Data Platform Built for Advanced Analytics. ODBMS.org · SEPTEMBER 21, 2017

Related Posts

– On Open Source Databases. Interview with Peter Zaitsev, ODBMS Industry Watch, Published on 2017-09-06

– On Apache Ignite, Apache Spark and MySQL. Interview with Nikita Ivanov , ODBMS Industry Watch, Published on 2017-06-30

– On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah, ODBMS Industry Watch, Published on 2017-03-13

Follow us on Twitter: @odbmsorg

Database Challenges and Innovations. Interview with Jim Starkey

Roberto V. Zicari — Wed, 31 Aug 2016 03:33:42 +0000

“Isn’t it ironic that in 2016 a non-skilled user can find a web page from Google’s untold petabytes of data in millisecond time, but a highly trained SQL expert can’t do the same thing in a relational database one billionth the size?.–Jim Starkey.

I have interviewed Jim Starkey. A database legend, Jim’s career as an entrepreneur, architect, and innovator spans more than three decades of database history.

RVZ

Q1. In your opinion, what are the most significant advances in databases in the last few years?

Jim Starkey: I’d have to say the “atom programming model” where a database is layered on a substrate of peer-to-peer replicating distributed objects rather than disk files. The atom programming model enables scalability, redundancy, high availability, and distribution not available in traditional, disk-based database architectures.

Q2. What was your original motivation to invent the NuoDB Emergent Architecture?

Jim Starkey: It all grew out of a long Sunday morning shower. I knew that the performance limits of single-computer database systems were in sight, so distributing the load was the only possible solution, but existing distributed systems required that a new node copy a complete database or partition before it could do useful work. I started thinking of ways to attack this problem and came up with the idea of peer to peer replicating distributed objects that could be serialized for network delivery and persisted to disk. It was a pretty neat idea. I came out much later with the core architecture nearly complete and very wrinkled (we have an awesome domestic hot water system).

Q3. In your career as an entrepreneur and architect what was the most significant innovation you did?

Jim Starkey: Oh, clearly multi-generational concurrency control (MVCC). The problem I was trying to solve was allowing ad hoc access to a production database for a 4GL product I was working on at the time, but the ramifications go far beyond that. MVCC is the core technology that makes true distributed database systems possible. Transaction serialization is like Newtonian physics – all observers share a single universal reference frame. MVCC is like special relativity, where each observer views the universe from his or her reference frame. The views appear different but are, in fact, consistent.

Q4. Proprietary vs. open source software: what are the pros and cons?

Jim Starkey: It’s complicated. I’ve had feet in both camps for 15 years. But let’s draw a distinction between open source and open development. Open development – where anyone can contribute – is pretty good at delivering implementations of established technologies, but it’s very difficult to push the state of the art in that environment. Innovation, in my experience, requires focus, vision, and consistency that are hard to maintain in open development. If you have a controlled development environment, the question of open source versus propriety is tactics, not philosophy. Yes, there’s an argument that having the source available gives users guarantees they don’t get from proprietary software, but with something as complicated as a database, most users aren’t going to try to master the sources. But having source available lowers the perceived risk of new technologies, which is a big plus.

Q5. You led the Falcon project – a transactional storage engine for the MySQL server- through the acquisition of MySQL by Sun Microsystems. What impact did it have this project in the database space?

Jim Starkey: In all honesty, I’d have to say that Falcon’s most important contribution was its competition with InnoDB. In the end, that competition made InnoDB three times faster. Falcon, multi-version in memory using the disk for backfill, was interesting, but no matter how we cut it, it was limited by the performance of the machine it ran on. It was fast, but no single node database can be fast enough.

Q6. What are the most challenging issues in databases right now?

Jim Starkey: I think it’s time to step back and reexamine the assumptions that have accreted around database technology – data model, API, access language, data semantics, and implementation architectures. The “relational model”, for example, is based on what Codd called relations and we call tables, but otherwise have nothing to do with his mathematic model. That model, based on set theory, requires automatic duplicate elimination. To the best of my knowledge, nobody ever implemented Codd’s model, but we still have tables which bear a scary resemblance to decks of punch cards. Are they necessary? Or do they just get in the way?
Isn’t it ironic that in 2016 a non-skilled user can find a web page from Google’s untold petabytes of data in millisecond time, but a highly trained SQL expert can’t do the same thing in a relational database one billionth the size?. SQL has no provision for flexible text search, no provision for multi-column, multi-table search, and no mechanics in the APIs to handle the results if it could do them. And this is just one a dozen problems that SQL databases can’t handle. It was a really good technical fit for computers, memory, and disks of the 1980’s, but is it right answer now?

Q7. How do you see the database market evolving?

Jim Starkey: I’m afraid my crystal ball isn’t that good. Blobs, another of my creations, spread throughout the industry in two years. MVCC took 25 years to become ubiquitous. I have a good idea of where I think it should go, but little expectation of how or when it will.

Qx. Anything else you wish to add?

Jim Starkey: Let me say a few things about my current project, AmorphousDB, an implementation of the Amorphous Data Model (meaning, no data model at all). AmorphousDB is my modest effort to question everything database.
The best way to think about Amorphous is to envision a relational database and mentally erase the boxes around the tables so all records free float in the same space – including data and metadata. Then, if you’re uncomfortable, add back a “record type” attribute and associated syntactic sugar, so table-type semantics are available, but optional. Then abandon punch card data semantics and view all data as abstract and subject to search. Eliminate the fourteen different types of numbers and strings, leaving simply numbers and strings, but add useful types like URL’s, email addresses, and money. Index everything unless told not to. Finally, imagine an API that fits on a single sheet of paper (OK, 9 point font, both sides) and an implementation that can span hundreds of nodes. That’s AmorphousDB.

————
Jim Starkey invented the NuoDB Emergent Architecture, and developed the initial implementation of the product. He founded NuoDB [formerly NimbusDB] in 2008, and retired at the end of 2012, shortly before the NuoDB product launch.

Jim’s career as an entrepreneur, architect, and innovator spans more than three decades of database history from the Datacomputer project on the fledgling ARPAnet to his most recent startup, NuoDB, Inc. Through the period, he has been
responsible for many database innovations from the date data type to the BLOB to multi-version concurrency control (MVCC). Starkey has extensive experience in proprietary and open source software.

Starkey joined Digital Equipment Corporation in 1975, where he created the Datatrieve family of products, the DEC Standard Relational Interface architecture, and the first of the Rdb products, Rdb/ELN. Starkey was also software architect for DEC’s database machine group.

Leaving DEC in 1984, Starkey founded Interbase Software to develop relational database software for the engineering workstation market. Interbase was a technical leader in the database industry producing the first commercial implementations of heterogeneous networking, blobs, triggers, two phase commit, database events, etc. Ashton-Tate acquired Interbase Software in 1991, and was, in turn, acquired by Borland International a few months later. The Interbase database engine was released open source by Borland in 2000 and became the basis for the Firebird open source database project.

In 2000, Starkey founded Netfrastructure, Inc., to build a unified platform for distributable, high quality Web applications. The Netfrastructure platform included a relational database engine, an integrated search engine, an integrated Java virtual machine, and a high performance page generator.

MySQL, AB, acquired Netfrastructure, Inc. in 2006 to be the kernel of a wholly owned transactional storage engine for the MySQL server, later known as Falcon. Starkey led the Falcon project through the acquisition of MySQL by Sun Microsystems.

Jim has a degree in Mathematics from the University of Wisconsin.
For amusement, Jim codes on weekends, while sailing, but not while flying his plane.

——————

Resources

– NuoDB Emergent Architecture (.PDF)

– On Database Resilience. Interview with Seth Proctor, ODBMs Industry Watch, March 17, 2015

– Hands-On with NuoDB and Docker, BY MJ Michaels, NuoDB. ODBMS.org– OCT 27 2015

– How leading Operational DBMSs rank popularity wise? By Michael Waclawiczek– ODBMS.org · JANUARY 27, 2016

– A Glimpse into U-SQL BY Stephen Dillon, Schneider Electric, ODBMS.org-DECEMBER 7, 2015

– Gartner Magic Quadrant for Operational DBMS 2015

Follow us on Twitter: @odbmsorg

Using NoSQL for Ireland’s Online Tax Research Database.

Roberto V. Zicari — Mon, 02 May 2016 08:18:17 +0000

“When the Institute began to look for a new platform, it became apparent that a relational database was not the best solution to effectively manage and deliver our XML content.”–Martin Lambe.

The Irish Tax Institute is the leading representative and educational body for Ireland’s AITI Chartered Tax Advisers (CTA) and is the only professional body exclusively dedicated to tax. One of their service is TaxFind – Ireland’s Leading Online Tax Research Database, offering Search to 200,000 pages of tax content, over 8,000 pages of Irish tax legislation, Irish Tax Institute tax technical papers, over 25 leading tax commentary publications, and 1000s of Irish Tax Review articles.

I did a joint interview with Martin Lambe, CEO of the Irish Tax Institute and Sam Herbert, Client Services Director at 67 Bricks.
Main topics of the interview are the data challenges they currently face, and the implementation of TaxFind using MarkLogic.

RVZ

Q1. What are the main data challenges you currently have at the Irish Tax Institute?

Martin Lambe: The Irish Tax Institute moved its publication workflow to an XML-based process in 2009 and we have a large archive of valuable tax information contained in quite complex XML format. The main challenge was to find a solution that could store the repository of data (XML and other formats) and provide a simple search interface that directs users very quickly to the most relevant result. The “findability” of relevant content is crucial.

Q2. What is the TaxFind research database?

Martin Lambe: The Irish Tax Institute is the main provider of tax information in Ireland and TaxFind is the Institute’s online tax research database. TaxFind offers subscribers access to Irish tax legislation and guidance that includes tax technical papers from seminars and conferences, as well as over 30 tax commentary publications. It is used by thousands of CTAs in Ireland on a daily basis to assist in their tax research.

Q3. Who are the members that benefit from this TaxFind research database?

Martin Lambe: TaxFind serves the Chartered Tax Adviser (CTA) community in Ireland and other tax professionals such as those in the global accounting firms.

Q4. Why did you discard your previous implementation with a relational database system?

Martin Lambe: The previous database was literally creaking at the seams. Users were increasingly frustrated with difficulties accessing the database on different browsers and the old platform did not support mobile devices or tablets. When the Institute began to look for a new platform, it became apparent that a relational database was not the best solution to effectively manage and deliver our XML content. XML content stored in a NoSQL document database is indexed specifically for the search engine and this means the performance of our search engine and the relevancy of results is dramatically improved.

Q5. Why did you select MarkLogic`s NoSQL database platform?

Sam Herbert: MarkLogic is scalable to support fast querying across large amounts of data, it deals with XML content very well (and most of the tax data is either in XML, or in HTML that can be treated as XHTML), and has good searching. It is also a good environment to develop in – it has excellent documentation, and good tooling. It helps that it uses XQuery as one of its query languages, rather than a proprietary database-specific language.

Q6. Is SQL still important for you?

Sam Herbert: I don’t think it’s true to say that any particular type of technology is “important” to ITI – it’s all about how it can benefit users. From a 67 Bricks perspective, we work with relational databases, NoSQL databases, and graph databases depending on what shape the data is and what the needs are around querying it.

Q7 Why not choose an open source solution?

Sam Herbert: We’re using Open Source components in other parts of the system, and we’re keen on using Open Source where possible. However, for the data store, there aren’t any Open Source alternatives that have the combination of good scalability, good support for XML content, a standard query language, and powerful searching that we were looking for.

Q8. Can you tell us a bit about the architecture of the new implementation of the TaxFind research database

Sam Herbert: There are three major components:

– a frontend display and service layer written using the Play framework
– the MarkLogic data store
– a semantic enrichment component using Semaphore SmartLogic and the ITI taxonomy

The Play component is what users interact with – both for human users coming to the web site, and automated use of the web services. The bulk of the data retrieval and manipulation is done via a set of XQuery functions defined within the MarkLogic store. When new data is uploaded, it is processed within the Play code, enriched using Semaphore SmartLogic, and then stored in MarkLogic.

Q9. How do you manage to integrate Irish Tax Institute`s tax data, bringing together in excess of 300,000 pages of tax content including archive material in Word, PDF, XML and HTML?

Sam Herbert: The most complex part of the data is the XML content. These are very large XML files representing legislation, books, and other tax materials, that are inter-related in complex ways, and with a lot of deeply nested hierarchy. An important part of managing the data was splitting these into appropriately sized fragments, and then identifying the linking between different files – for example a piece of legislation will refer to other legislation, and commentary will refer to that legislation, and a new piece of legislation may supersede an earlier piece.

The non-XML content is larger in volume, but each individual document is smaller and is structurally simpler. Managing this content was largely a matter of loading it in and letting it be indexed.

Q10. How do you capture and digitize information in various formats and make it searchable?

Sam Herbert: Making it searchable is straightforward – it’s making it searchable in ways that support the expectations of the users that’s much more difficult.

A good search experience requires both subject matter expertise and good automated tests.

The basic search is using MarkLogic’s full text search. The next step was to work with tax experts within and outside the ITI to identify appropriate facets within the content with which to group the results – based on a combination of what the user requirements were and what was supported by the data.

There were additional complexities around weighting the search results to make the “best” results come at the top in as many circumstances as possible – for example, weighting terms within headings, weighting more recent content, weighting content based on its category so legislation is more important than commentary, and weighting content higher based on its popularity. The semantic enrichment based on tax terms from the ITI taxonomy also enhances the searching.

Q11. How do you ensure that this solution is scalable?

Sam Herbert: The solution is deployed to a load-balanced cluster using Amazon Web Services. The Play frontend is purely stateless REST. This means that we can scale to support more users easily by spinning up more servers – and using AWS makes this easy. Overall, using AWS has been a big win for us, in terms of being able to get servers running easily, being able to increase and decrease things like their memory size easily, and the various ancillary services it provides like DNS and load balancing. By making sure we can scale to support additional data, we can use MarkLogic effectively.

————-

Martin Lambe is Chief Executive of the Irish Tax Institute. His previous role within the Institute was that of Director of Finance.

Sam Herbert is Client Services Director at 67 Bricks, a company that works with information owners (particularly publishers) who want to enrich their content to make it more structured, granular, flexible and reusable.
67 Bricks utilises its deep understanding of the content enrichment challenge to help publishers develop systems and capabilities to increase the value of their content. With expertise in XML, business analysis, semantic tagging and software development, 67 Bricks works closely with its clients to develop and implement content enrichment capabilities and enriched content digital products.

————-
Resources

– Irish Tax Institute

– TaxFind

– 67 Bricks

– MarkLogic

–Unthink: Moving Beyond the Constraints of Relational Databases. by Tom McGrath, MarkLogic. ODBMS.org March 14, 2016.

–MarkLogic Case Study: Royal Society of Chemistry.ODBMS.org

–On making information accessible. Interview with David Leeming. ODBMS Industry Watch, on July 30, 2014

Follow us on Twitter: @odbmsorg

A Grand Tour of Big Data. Interview with Alan Morrison

Roberto V. Zicari — Thu, 25 Feb 2016 15:52:44 +0000

“Leading enterprises have a firm grasp of the technology edge that’s relevant to them. Better data analysis and disambiguation through semantics is central to how they gain competitive advantage today.”–Alan Morrison.

I have interviewed Alan Morrison, senior research fellow at PwC, Center for Technology and Innovation.
Main topic of the interview is how the Big Data market is evolving.

RVZ

Q1. How do you see the Big Data market evolving?

Alan Morrison: We should note first of all how true Big Data and analytics methods emerged and what has been disruptive. Over the course of a decade, web companies have donated IP and millions of lines of code that serves as the foundation for what’s being built on top. In the process, they’ve built an open source culture that is currently driving most big data-related innovation. As you mentioned to me last year, Roberto, a lot of database innovation was the result of people outside the world of databases changing what they thought needed to be fixed, people who really weren’t versed in the database technologies to begin with.

Enterprises and the database and analytics systems vendors who serve them have to constantly adjust to the innovation that’s being pushed into the open source big data analytics pipeline. Open source machine learning is becoming the icing on top of that layer cake.

Q2. In your opinion what are the challenges of using Big Data technologies in the enterprise?

Alan Morrison: Traditional enterprise developers were thrown for a loop back in the late 2000s when it comes to open source software, and they’re still adjusting. The severity of the problem differs depending on the age of the enterprise. In our 2012 issue of the Forecast on DevOps, we made clear distinctions between three age classes of companies: legacy mainstream enterprises, pre-cloud enterprises and cloud natives. Legacy enterprises could have systems that are 50 years old or more still in place and have simply added to those. Pre-cloud enterprises are fighting with legacy that’s up to 20 years old. Cloud natives don’t have to fight legacy and can start from scratch with current tech.

DevOps (dev + ops) is an evolution of agile development that focuses on closer collaboration between developers and operations personnel. It’s a successor to agile development, a methodology that enables multiple daily updates to operational codebases and feedback-response loop tuning by making small code changes and see how those change user experience and behaviour. The linked article makes a distinction between legacy, pre-cloud and cloud native enterprises in terms of their inherent level of agility:

Most enterprises are in the legacy mainstream group, and the technology adoption challenges they face are the same regardless of the technology. To build feedback-response loops for a data-driven enterprise in a legacy environment is more complicated in older enterprises. But you can create guerilla teams to kickstart the innovation process.

Q3. Is the Hadoop ecosystem now ready for enterprise deployment at large scale?

Alan Morrison: Hadoop is ten years old at this point, and Yahoo, a very large mature enterprise, has been running Hadoop on 10,000 nodes for years now. Back in 2010, we profiled a legacy mainstream media company who was doing logfile analysis from all of its numerous web properties on a Hadoop cluster quite effectively. Hadoop is to the point where people in their dens and garages are putting it on Raspberry Pi systems. Lots of companies are storing data in or staging it from HDFS. HDFS is a given. MapReduce, on the other hand, has given way to Spark.

HDFS preserves files in their original format immutably, and that’s important. That innovation was crucial to data-driven application development a decade ago. But Hadoop isn’t the end state for distributed storage, and NoSQL databases aren’t either. It’s best to keep in mind that alternatives to Hadoop and its ecosystem are emerging.

I find it fascinating what folks like LinkedIn and Metamarkets are doing data architecture wise with the Kappa architecture–essentially a stream processing architecture that also works for batch analytics, a system where operational and analytical data are one and the same. That’s appropriate for fully online, all-digital businesses. You can use HDFS, S3, GlusterFS or some other file system along with a database such as Druid. On the transactional side of things, the nascent IPFS (the Interplanetary File System) anticipates both peer-to-peer and the use of blockchains in environments that are more and more distributed. Here’s a diagram we published last year that describes this evolution to date:

From PWC Technology Forecast 2015

People shouldn’t be focused on Hadoop, but what Hadoop has cleared a path for that comes next.

Q4. What are in your opinion the most innovative Big Data technologies?

Alan Morrison: The rise of immutable data stores (HDFS, Datomic, Couchbase and other comparable databases, as well as blockchains) was significant because it was an acknowledgement that data history and permanence matters, the technology is mature enough and the cost is low enough to eliminate the need to overwrite. These data stores also established that eliminating overwrites also eliminates a cause of contention. We’re moving toward native cloud and eventually the P2P fog (localized, more truly distributed computing) that will extend the footprint of the cloud for the Internet of things.

Unsupervised machine learning has made significant strides in the past year or two, and it has become possible to extract facts from unstructured data, building on the success of entity and relationship extraction. What this advance implies is the ability to put humans in feedback loops with machines, where they let machines discover the data models and facts and then tune or verify those data models and facts.

In other words, large enterprises now have the capability to build their own industry- and organization-specific knowledge graphs and begin to develop cognitive or intelligent apps on top those knowledge graphs, along the lines of what Cirrus Shakeri of Inventurist envisions.

From Cirrus Shakeri, “From Big Data to Intelligent Applications,” post, January 2015

At the core of computable semantic graphs (Shakeri’s term for knowledge graphs or computable knowledge bases) is logically consistent semantic metadata. A machine-assisted process can help with entity and relationship extraction and then also ontology generation.

Computability = machine readability. Semantic metadata–the kind of metadata cognitive computing apps use–can be generated with the help of a well-designed and updated ontology. More and more, these ontologies are uncovered in text rather than hand built, but again, there’s no substitute for humans in the loop. Think of the process of cognitive app development as a continual feedback-response loop process. The use of agents can facilitate the construction of these feedback loops.

Q5. In a recent note Carl Olofson, Research Vice President, Data Management Software Research, IDC, predicted the RIP of “Big Data” as a concept. What is your view on this?

Alan Morrison: I agree the term is nebulous and can be misleading, and we’ve had our fill of it. But that doesn’t mean it won’t continue to be used. Here’s how we defined it back in 2009:

Big Data is not a precise term; rather, it is a characterization of the never-ending accumulation of all kinds of data, most of it unstructured. It describes data sets that are growing exponentially and that are too large, too raw, or too unstructured for analysis using relational database techniques. Whether terabytes or petabytes, the precise amount is less the issue than where the data ends up and how it is used. (See https://www.pwc.com/us/en/technology-forecast/assets/pwc-tech-forecast-issue3-2010.pdf, pg. 6.)

For that issue of the Forecast, we focused on how Hadoop was being piloted in enterprises and the ecosystem that was developing around it. Hadoop was the primary disruptive technology, as well as NoSQL databases. It helps to consider the data challenge of the 2000s and how relational databases and enterprise data warehousing techniques were falling short at that point. Hadoop has reduced the cost of analyzing data by an order of magnitude and allows processing of very large unstructured datasets. NoSQL has made it possible to move away from rigid data models and standard ETL.

“Big Data” can continue to be shorthand for petabytes of unruly, less structured data. But why not talk about the system instead of just the data? I like the term that George Gilbert of Wikibon latched on to last year. I don’t know if he originated it, but he refers to the System of Intelligence. That term gets us beyond the legacy, pre-web “business intelligence” term, more into actionable knowledge outputs that go beyond traditional reporting and into the realm of big data, machine learning and more distributed systems. The Hadoop ecosystem, other distributed file systems, NoSQL databases and the new analytics capabilities that rely on them are really at the heart of a System of Intelligence.

Q6. How many enterprise IT systems do you think we will need to interoperate in the future?

Alan Morrison: I like Geoffrey Moore‘s observations about a System of Engagement that emerged after the System of Record, and just last year George Gilbert was adding to that taxonomy with a System of Intelligence. But you could add further to that with a System of Collection that we still need to build. Just to be consistent, the System of Collection articulates how the Internet of Things at scale would function on the input side. The System of Engagement would allow distribution of the outputs. For the outputs of the System of Collection to be useful, that system will need to interoperate in various ways with the other systems.

To summarize, there will actually be four enterprise IT systems that will need to interoperate, ultimately. Three of these exist, and one still needs to be created.

System of Collection: The Internet of Things ( (The Fog–yet to be created)–see Maher Abdelshkour, IoT, from Cloud to Fog Computing
System of Intelligence: big data, analytics, machine learning (The Cloud) –see George Gilbert on Systems of Intelligence: The Next Generation of Enterprise Applications built on Big Data
System of Engagement: social, mobile (The Cloud) See Geoffrey Moore,Systems of Engagement and the Future of Enterprise IT: A Sea Change in Enterprise IT
System of Record: ERP, CRM, SCM…. (The Core) Also described in Moore’s article above

The fuller picture will only emerge when this interoperation becomes possible.

Q7. What are the requirements, heritage and legacy of such systems?

Alan Morrison: The System of Record (RDBMSes) still relies on databases and tech with their roots in the pre-web era. I’m not saying these systems haven’t been substantially evolved and refined, but they do still reflect a centralized, pre-web mentality. Bitcoin and Blockchain make it clear that the future of Systems of Record won’t always be centralized. In fact, microtransaction flows in the Internet of Things at scale will depend on the decentralized approaches, algorithmic transaction validation, and immutable audit trail creation which blockchain inspires.

The Web is only an interim step in the distributed system evolution. P2P systems will eventually complemnt the web, but they’ll take a long time to kick in fully–well into the next decade. There’s always the S-curve of adoption that starts flat for years. P2P has ten years of an installed base of cloud tech, twenty years of web tech and fifty years plus of centralized computing to fight with. The bitcoin blockchain seems to have kicked P2P in gear finally, but progress will be slow through 2020.

The System of Engagement (requiring Web DBs) primarily relies on Web technnology (MySQL and NoSQL) in conjunction with traditional CRM and other customer-related structured databases.

The System of Intelligence (requiring Web file systems and less structured DBs) primarily relies on NoSQL, Hadoop, the Hadoop ecosystem and its successors, but is built around a core DW/DM RDBMS analytics environment with ETLed structured data from the System of Record and System of Engagement. The System of Intelligence will have to scale and evolve to accommodate input from the System of Collection.

The System of Collection (requiring distributed file systems and DBs) will rely on distributed file system successors to Hadoop and HTTP such as IPFS and the more distributed successors to MySQL+ NoSQL. Over the very long term, a peer-to-peer architecture will emerge that will become necessary to extend the footprint of the internet of things and allow it to scale.

Q8. Do you already have the piece parts to begin to build out a 2020+ intersystem vision now?

Alan Morrison: Contextual, ubiquitous computing is the vision of the 2020s, but to get to that, we need an intersystem approach. Without interoperation of the four systems I’ve alluded to, enterprises won’t be able to deliver the context required for competitive advantage. Without sufficient entity and relationship disambiguation via machine learning in machine/human feedback loops, enterprises won’t be able to deliver the relevance for competitive advantage.

We do have the piece parts to begin to build out an intersystem vision now. For example, interoperation is a primary stumbling block that can be overcome now. Middleware has been overly complex and inadequate to the current-day task, but middleware platforms such as EnterpriseWeb are emerging that can reach out as an integration fabric for all systems, up and down the stack. Here’s how the integration fabric becomes an essential enabler for the intersystem approach:

PwC, 2015

A lot of what EnterpriseWeb (full disclosure: a JBR partner of PwC) does hinges on the creation and use of agents and semantic metadata that enable the data/logic virtualization. That’s what makes the desiloing possible. One of the things about the EnterpriseWeb platform is that it’s a full stack virtual integration and application platform, using methods that have data layer granularity, but process layer impact. Enterprise architects can tune their models and update operational processes at the same time. The result: every change is model-driven and near real-time. Stacks can all be simplified down to uniform, virtualized composable entities using enabling technologies that work at the data layer. Here’s how they work:

PwC, 2015

So basically you can do process refinement across these systems, and intersystem analytics views thus also become possible.

Qx anything else you wish to add?

Alan Morrison: We always quote science fiction writer William Gibson, who said,

“The future is already here — it’s just not very evenly distributed.”

Enterprises would do best to remind themselves what’s possible now and start working with it. You’ve got to grab onto that technology edge and let it pull you forward. If you don’t understand what’s possible, most relevant to your future business success and how to use it, you’ll never make progress and you’ll always be reacting to crises. Leading enterprises have a firm grasp of the technology edge that’s relevant to them. Better data analysis and disambiguation through semantics is central to how they gain competitive advantage today.

We do a ton of research to get to the big picture and find the real edge, where tech could actually have a major business impact. And we try to think about what the business impact will be, rather than just thinking about the tech. Most folks who are down in the trenches are dismissive of the big picture, but the fact is they aren’t seeing enough of the horizon to make an informed judgement. They are trying to use tools they’re familiar with to address problems the tools weren’t designed for. Alongside them should be some informed contrarians and innovators to provide balance and get to a happy medium.

That’s how you counter groupthink in an enterprise. Executives need to clear a path for innovation and foster a healthy, forward-looking, positive and tolerant mentality. If the workforce is cynical, that’s an indication that they lack a sense of purpose or are facing systemic or organizational problems they can’t overcome on their own.

—————–
Alan Morrison (@AlanMorrison) is a senior research fellow at PwC, a longtime technology trends analyst and an issue editor of the firm’s Technology Forecast

Resources

Data-driven payments. How financial institutions can win in a networked economy, BY, Mark Flamme, Partner; Kevin Grieve, Partner; Mike Horvath, Principal Strategy&. FEBRUARY 4, 2016, ODBMS.org

The rise of immutable data stores, By Alan Morrison, Senior Manager, PwC Center for technology and innovation (CTI), OCTOBER 9, 2015, ODBMS.org

The enterprise data lake: Better integration and deeper analytics, By Brian Stein and Alan Morrison, PwC, AUGUST 20, 2014 ODBMS.org

Related Posts

On the Industrial Internet of Things. Interview with Leon Guzenda , ODBMS Industry Watch, January 28, 2016

On Big Data and Society. Interview with Viktor Mayer-Schönberger , ODBMS Industry Watch, January 8, 2016

On Big Data Analytics. Interview with Shilpa Lawande , ODBMS Industry Watch, December 10, 2015

On Dark Data. Interview with Gideon Goldin , ODBMS Industry Watch, November 16, 2015

Follow us on Twitter: @odbmsorg

Data for the Common Good. Interview with Andrea Powell

Roberto V. Zicari — Tue, 09 Jun 2015 10:55:08 +0000

“CABI has a proud history (we were founded in 1910) of serving the needs of agricultural researchers around the world, and it is fascinating to see how technology can now help to achieve our development mission. We can have much greater impact at scale these days on the lives of poor farmers around the world (on whom we are all dependent for our food) by using modern technology and by putting knowledge into the hands of those who need it the most.”–Andrea Powell

I have interviewed Andrea Powell,Chief Information Officer at CABI.
Main topic of the interview is how to use data and knowledge for the Common Good, specifically by solving problems in agriculture and the environment.

RVZ

Q1. What is the main mission of CABI?

Andrea Powell: CABI’s mission is to improve people’s lives and livelihoods by solving problems in agriculture and the environment.
CABI is a not-for-profit, intergovernmental organisation with over 500 staff based in 17 offices around the world. We focus primarily on plant health issues, helping smallholder farmers to lose less of what they grow and therefore to increase their yields and their incomes.

Q2. How effective is scientific publishing in helping the developing world solving agricultural problems?

Andrea Powell: Our role is to bridge the gap between research and practice.
Traditional scientific journals serve a number of purposes in the scholarly communication landscape, but they are often inaccessible or inappropriate for solving the problems of farmers in the developing world. While there are many excellent initiatives which provide free or very low-cost access to the research literature in these countries, what is often more effective is working with local partners to develop and implement local solutions which draw on and build upon that body of research.
Publishers have pioneered innovative uses of technology, such as mobile phones, to ensure that the right information is delivered to the right person in the right format.
This can only be done if the underlying information is properly categorised, indexed and stored, something that publishers have done for many decades, if not centuries. Increasingly we are able to extract extra value from original research content by text and data mining and by adding extra semantic concepts so that we can solve specific problems.

Q3. What are the typical real-world problems that you are trying to solve? Could you give us some examples of your donor-funded development programs?

Andrea Powell: In our Plantwise programme, we are working hard to reduce the crop losses that happen due to the effects of plant pests and diseases. Farmers can typically lose up to 40% of their crop in this way, so achieving just a 1% reduction in such losses could feed 25 million more hungry mouths around the world. Another initiative, called mNutrition, aims to deliver practical advice to farming families in the developing world about how to grow more nutritionally valuable crops, and is aimed at reducing child malnutrition and stunting.

Q4. How do you measure your impact and success?

Andrea Powell: We have a strong focus on Monitoring and Evaluation, and for each of our projects we include a “Theory of Change” which allows us to measure and monitor the impact of the work we are doing. In some cases, our donors carry out their own assessments of our projects and require us to demonstrate value for money in measurable ways.

Q5. What are the main challenges you are currently facing for ensuring CABI’s products and services are fit for purpose in the digital age?

Andrea Powell: The challenges vary considerably depending on the type of customer or beneficiary.
In our developed world markets, we already generate some 90% of our income from digital products, so the challenge there is keeping our products and platforms up-to-date and in tune with the way modern researchers and practitioners interact with digital content. In the developing world, the focus is much more on the use of mobile phone technology, so transforming our content into a format that makes it easy and cheap to deliver via this medium is a key challenge. Often this can take the form of a simple text message which needs to be translated into multiple languages and made highly relevant for the recipient.

Q6. You have one of the world’s largest agricultural database that sits in a RDBMS, and you also have info silos around the company. How do you pull all of these information together?

Andrea Powell: At the moment, with some difficulty! We do use APIs to enable us to consume content from a variety of sources in a single product and to render that content to our customers using a highly flexible Web Content Management System. However, we are in the process of transforming our current technology stack and replacing some of our Relational Databases with MarkLogic, to give us more flexibility and scaleability. We are very excited about the potential this new approach offers.

Q7. How do you represent and model all of this knowledge? Could you give us an idea of how the data management part for your company is designed and implemented?

Andrea Powell: We have a highly structured taxonomy that enables us to classify and categorise all of our information in a consistent and meaningful way, and we have recently implemented a semantic enrichment toolkit, TEMIS Luxid® to make this process even more efficient and automated. We are also planning to build a Knowledge Graph based on linked open data, which will allow us to define our domain even more richly and link our information assets (and those of other content producers) by defining the relationships between different concepts.

Q8. What kind of predictive analytics do you use or plan to use?

Andrea Powell: We are very excited by the prospect of being able to do predictive analysis on the spread of particular crop diseases or on the impact of invasive species. We have had some early investigations into how we can use semantics to achieve this; e.g. if pest A attacks crop B in country C, what is the likelihood of it attacking crop D in country E which has the same climate and soil types as country C?

Q9. How do you intend to implement such predictive analytics?

Andrea Powell: We plan to deploy a combination of expert subject knowledge, data mining techniques and clever programming!

Q10. What are future strategic developments?

Andrea Powell: Increasingly we are developing knowledge-based solutions that focus on solving specific problems and on fitting into user workflows, rather than creating large databases of content with no added analysis or insight. Mobile will become the primary delivery channel and we will also be seeking to use mobile technology to gather user data for further analysis and product development.

Qx Anything else you wish to add?

Andrea Powell: CABI has a proud history (we were founded in 1910) of serving the needs of agricultural researchers around the world, and it is fascinating to see how technology can now help to achieve our development mission. We can have much greater impact at scale these days on the lives of poor farmers around the world (on whom we are all dependent for our food) by using modern technology and by putting knowledge into the hands of those who need it the most.

————–
ANDREA POWELL,Chief Information Officer, CABI, United Kingdom.
I am a linguist by training (French and Russian) with an MA from Cambridge University but have worked in the information industry since graduating in 1988. After two and a half years with Reuters I joined CABI in the Marketing Department in 1991 and have worked here ever since. Since January 2015 I have held the position of Chief Information Officer, leading an integrated team of content specialists and technologists to ensure that all CABI’s digital and print publications are produced on time and to the quality standards expected by our customer worldwide. I am responsible for future strategic development, for overseeing the development of our technical infrastructure and data architecture, and for ensuring that appropriate information & communication technologies are implemented in support of CABI’s agricultural development programmes around the world.

Resources

– More information about how CABI is using MarkLogic can be found in this video, recorded at MarkLogic World San Francisco, April 2015.

Follow ODBMS.org on Twitter: @odbmsorg

Big Data and the financial services industry. Interview with Simon Garland

Roberto V. Zicari — Tue, 02 Jun 2015 07:56:43 +0000

“The type of data we see the most is market data, which comes from exchanges like the NYSE, dark pools and other trading platforms. This data may consist of many billions of records of trades and quotes of securities with up to nanosecond precision — which can translate into many terabytes of data per day.”–Simon Garland

The topic of my interview with Simon Garland, Chief Strategist at Kx Systems, is Big Data and the financial services industry.

RVZ

Q1. Talking about the financial services industry, what types of data and what quantities are common?

Simon Garland: The type of data we see the most is market data, which comes from exchanges like the NYSE, dark pools and other trading platforms. This data may consist of many billions of records of trades and quotes of securities with up to nanosecond precision — which can translate into many terabytes of data per day.

The data comes in through feed-handlers as streaming data. It is stored in-memory throughout the day and is appended to the on-disk historical database at the day’s end. Algorithmic trading decisions are made on a millisecond basis using this data. The associated risks are evaluated in real-time based on analytics that draw on intraday data that resides in-memory and historical data that resides on disk.

Q2. What are the most difficult data management requirements for high performance financial trading and risk management applications?

Simon Garland: There has been a decade-long arms race on Wall Street to achieve trading speeds that get faster every year. Global financial institutions in particular have spent heavily on high performance software products, as well as IT personnel and infrastructure just to stay competitive. Traders require accuracy, stability and security at the same time that they want to run lightning fast algorithms that draw on terabytes of historical data.

Traditional databases cannot perform at these levels. Column store databases are generally recognized to be orders of magnitude faster than regular RDBMS; and a time-series optimized columnar database is uniquely suited for delivering the performance and flexibility required by Wall Street.

Q3. And why is this important for businesses?

Simon Garland: Orders of magnitude improvements in performance will open up new possibilities for “what-if” style analytics and visualization; speeding up their pace of innovation, their awareness of real-time risks and their responsiveness to their customers.

The Internet of Things in particular is important to businesses who can now capitalize on the digitized time-series data they collect, like from smart meters and smart grids. In fact, I believe that this is only the beginning of the data volumes we will have to be handling in the years to come. We will be able to combine this information with valuable data that businesses have been collecting for decades.

Q4. One of the promise of Big Data for many businesses is the ability to effectively use both streaming data and the vast amounts of historical data that will accumulate over the years, as well as the data a business may already have warehoused, but never has been able to use. What are the main challenges and the opportunities here?

Simon Garland: This can seem like a challenge for people trying to put a system together from a streaming database; an in-memory database from a different vendor, and an historical database from yet another vendor. They then pull data from all of these applications into yet another programming environment. This method cannot give performance and long term is fragile and unmaintainable.

The opportunity here is for a database platform that unifies the software stack, like kdb+, that is robust, easily scalable and easily maintainable.

Q5. How difficult is to combine and process streaming, in-memory and historical data in real time analytics at scale?

Simon Garland: This is an important question. These functionalities can’t be added afterwards. Kdb+ was designed for streaming data, in-memory data and historical data from the beginning. It was also designed with multi-core and multi-process support from the beginning which is essential for processing large amounts of historical data in parallel on current hardware.

We were doing this for decades, even before multi-core machines existed — which is why Wall Street was an early adopter of our technology.

Q6. q programming language vs. SQL: could you please explain the main differences? And also highlight the Pros and cons of each.

Simon Garland: The q programming language is built into the database system kdb+. It is an array programming language that inherently supports the concepts of vectors and column store databases rather than the rows and records that traditional SQL supports.

The main difference is that traditional SQL doesn’t have a concept of order built in, whereas the q programming language does. Unlike traditional SQL, the language q contains a concept of order. This makes complete sense when dealing with time-series data.

Q is intuitive and the syntax is extremely concise, which leads to more productivity, less maintenance and quicker turn-around time.

Q7. Could you give us some examples of successful Big Data real time analytics projects you have been working on?

Simon Garland: Utility applications are using kdb+ for millisecond queries of tables with hundreds of billions of data points captured from millions of smart meters. Analytics on this data can be used for balancing power generation, managing blackouts and for billing and maintenance.

Internet companies with massive amounts of traffic are using kdb+ to analyze Googlebot behavior to learn how to modify pages to improve their ranking. They tell us that traditional databases simply won’t work when they have 100 million pages receiving hundreds of millions of hits per day.

In industries like pharmaceuticals, where decision-making is based on data that can be one day, one week or one month old, our customers and prospects say our column store database makes their legacy data warehouse software obsolete. It is many times faster on the same queries. The time needed for complex analyses on extremely large tables has literally been reduced from hours to seconds.

Q8. Are there any similarities in the way large data sets are used in different vertical markets such as financial service, energy & pharmaceuticals?

Simon Garland: The shared feature is that all of our customers have structured, time-series data. The scale of their data problems are completely different, as are their business use cases. The financial services industry, where kdb+ is an industry standard, demands constant improvements to real-time analytics.

Other industries, like pharma, telecom, oil and gas and utilities, have a different concept of time. They also often are working with smaller data extracts, which they often still consider “Big Data.” When data comes in one day, one week or one month after an event occurred, there is not the same sense of real-time decision making as in finance. Having faster results for complex analytics helps all industries innovate and become more responsive to their customers.

Q9. Anything else you wish to add?

Simon Garland: If we piqued your interest, we have a free, 32-bit version of kdb+ available for download on our web site.

————-
Simon Garland, Chief Strategist, Kx Systems
Simon is responsible for upholding Kx’s high standards for technical excellence and customer responsiveness. He also manages Kx’s participation in the Securities Trading Analysis Center, overseeing all third-party benchmarking.
Prior to joining Kx in 2002, Simon worked at a database search engine company.
Before that he worked at Credit Suisse in risk management. Simon has developed software using kdb+ and q, going back to when the original k and kdb were introduced. Simon received his degree in Mathematics from the University of London and is currently based in Europe.

Resources

– LINK to Download of the free 32-bit version of kdb+

– Q Tips: Fast, Scalable and Maintainable Kdb+, Author: Nick Psaris

–On Big Data and the Internet of Things. Interview with Bill Franks. Source: ODBMS Industry Watch, Published on 2015-03-09

–On MarkLogic 8. Interview with Stephen Buxton. Source: ODBMS Industry Watch, Published on 2015-02-13

Follow ODBMS.org on Twittwer: @odbmsorg
##