ODBMS Industry Watch » NoSQL http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Sun, 02 Apr 2017 17:59:10 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.13 On Digital labor: Technology, Challenges and Opportunities. Interview with Michael Henry http://www.odbms.org/blog/2017/02/on-digital-labor-technology-challenges-and-opportunities-interview-with-michael-henry/ http://www.odbms.org/blog/2017/02/on-digital-labor-technology-challenges-and-opportunities-interview-with-michael-henry/#comments Mon, 27 Feb 2017 08:51:48 +0000 http://www.odbms.org/blog/?p=4333

“Digital labor is the name for a new class of tools that can automate routine cognitive tasks. The benefits of automation are similar to previous waves. Many years ago I helped automate a reconciliation function for a large asset manager. Humans took authorization reports from their investment control system and matched them against the confirmations coming from their counterparts. This was a terrible job, and luckily no one does this anymore.
Digital labor has the potential to improve the financial services sector by improving compliance, providing more analytics for risk and control functions, and improving efficiency.”–Michael Henry

I have interviewed Michael Henry, Principal at KPMG LLP. In the interview we covered the challenges faced by financial institutions due to existing regulations standards, KPMG`s solution to automate the onboarding process for their clients, and the potential impact of Digital labor for the financial services sector.

RVZ

Q1. The Organisation for Economic Co-operation and Development (OECD) proposed a Common Reporting Standard (CRS) for the Automatic Exchange of Information (AEOI) that implies a significant increase in the customer due diligence and reporting obligations of financial institutions across the world. What is the implication for your clients?

Michael Henry: The new reporting requirement will require financial institutions to collect and examine more information about their clients for the purposes of tax withholding and reporting. Banks and other regulated institutions will have to examine information from their clients to make sure they are reporting their true residence for tax purposes. This is similar to the US Internal Revenue Service’s FATCA requirements. And like FATCA, many banks will respond by asking for more documentation from their clients and adding staff to perform due diligence on that documentation.

Q2. Specifically, what is “client on boarding”? How is it normally implemented by large financial institutions?

Michael Henry: Client on boarding refers to the series of processes that a financial institution undergoes to determine whether or not it should move forward with conducting or renewing business with a given customer.
The term is inclusive of the underlying regulatory and compliance practices governed by anti-money laundering (AML) and know-your-customer (KYC) rules.
Many large financial institutions deploy thousands of staff, often in low cost offshore locations to perform this function. These staff are usually equipped with basic workflow and data management technology. At Tier 1 organizations this can cost hundreds of millions of dollars annually while pinning their reputations on the shoulders of junior resources making subjective compliance policy interpretations.
For this basic client identification and validation process, one of our clients employs thousands of people in an offshore location. Because this work is boring and repetitive, the client tells us that the attrition rate is more than 10% per month. This presents an enormous risk to the business, as banks entrust their client experience, business results, and reputations to cheap clerical labor that likely joined the bank only a few months ago.

Q3. What are the typical problems?

Michael Henry: The bank must collect information to identify the client and determine the risk that the client will engage in some kind of unlawful activity. To perform this function, the bank must process a large number of data that enter the bank electronically, or through documents. Reading and interpreting documents and trying to apply complex compliance rules using manual processes is time-consuming, error-prone, and expensive.
Technology – Workflow, case management, relational databases, and imaging technologies while mature and effective, still require human beings to read, transcribe, and interpret data.
Inconsistency – Human operators interpret complex decision-trees of rules. The risk of subjectivity grows with the size of the operation.
Accuracy – The majority of today’s onboarding representatives execute what amount to “stare and compare” and “stare, copy and enter” processes. Over the course of a business day in which hundreds of pages or documents will be read and thousands of keystrokes completed, it is inevitable that operator errors will occur.

Q4. You have worked on a solution as a service to automate the onboarding process for your clients. Can you explain in a nutshell how did you do it?

Michael Henry: The solution is comprised of multiple digital labor components to read documents and apply policy rules by machines instead of people.
Humans focus on exceptions, i.e., cases which really require human judgment. Because the exception rates are low, much of the activity becomes straight-through.
The technology uses a combination of robotics, big data, and natural language processing integrated for the solution of KYC, AML, Tax classification, and other compliance activities.

Q5. How difficult was to integrate domain knowledge into advanced technology?

Michael Henry: Domain knowledge is critical. KPMG invested significant regulatory and compliance expertise to reinvent this process for ourselves and our clients. The technology only works because of this investment.
We use advanced technology, but it is all commercially available. Our ability to define specific ontologies and compliance rules on that technology is the differentiator.

Q6. How do you capture information from SEC filings, blog entries, social media, text messages and other sources of structured and unstructured data without manual intervention?

Michael Henry: We capture information from structured and unstructured sources through a combination of technologies. Optical character recognition (OCR) and natural language processing (NLP) software drive our content enrichment process. This allows our platform to ingest unstructured documents (with or without metadata), identify them, and then extract the relevant content according to our ontological models. Some exception processing occurs at this stage, especially if the quality of the documentation is poor.

Q7. How do you integrate, organize and mine customer data?

Michael Henry: Customer data are ingested to the platform through system extracts, tying in to document repositories and the establishment of secure FTP sites. These data then pass through our content enrichment engine and ultimately reside in our MarkLogic NoSQL database.

Q8. Why did you choose MarkLogic’s Enterprise NoSQL database?

Michael Henry: First, we are solving mission-critical subjects for the world’s leading financial institutions. We needed to have an institutional-grade, enterprise-hardened database at the core of our platform.
Second is given the size of the data sets involved, we needed to have a highly scalable database that could handle petabytes of data while simultaneously staging and orchestrating multiple run-time sequences. Finally, we found MarkLogic very aligned to our vision and a good partner in bringing the solution to market.

Q9. How do you use semantics, text analytics and visualisation?

Michael Henry: Semantic analysis allows us to handle unstructured data in natural language formats. Extracting the list of beneficial owners from a 100-page trust document can take a human hours. The tools are so proficient now, that with the right ontological models we can obtain dozens of data from an unstructured document at high volumes with little human intervention. We have been able to ingest hundreds of individual loan documents and produce a data hierarchy by client, by loan, and by event.

Q10. What results did you obtain so far? What is the order of magnitude reduction in human efforts you obtained? As human involvement in the process declines, is the number of errors in reports also declining?

Michael Henry: Today, we serve more than 20 clients. In the tax compliance area, a human may spend more than an hour ingesting a W8 form and conducting due diligence. Most of this is reading KYC documents. Our platform has the ability to handle more than 10 of these per hour per human exception handler. If the task involves humans reading documents and applying validation or other policies, and the rate of actual exceptions is low, we can take 80-90% of the manual effort out. And the tools keep getting better.
More important than the productivity gain is the consistency and accuracy of the automation. No human operator can apply thousands of policy rules consistently. We continue to tune our models, and the machine never forgets.

Q11. In your opinion, what is the impact of the introduction of “Digital Labor”services for the job service market and for the society at large?

Michael Henry: Digital labor is the name for a new class of tools that can automate routine cognitive tasks. The benefits of automation are similar to previous waves. Many years ago I helped automate a reconciliation function for a large asset manager. Humans took authorization reports from their investment control system and matched them against the confirmations coming from their counterparts. This was a terrible job, and luckily no one does this anymore.
Digital labor has the potential to improve the financial services sector by improving compliance, providing more analytics for risk and control functions, and improving efficiency.

************************************************************

Michael Henry Principal, Financial Services, KPMG LPP
Michael is a Principal in KPMG’s Digital Labor practice with more than 25 years’ experience in financial services. Michael specializes in the application of sophisticated technologies (big data, natural language processing, artificial intelligence, machine learning, workflow and robotics) to automate compliance processes. Michael has worked with global and regional banks, and his experience includes living and working in Europe and Asia.

Resources

– FATCA Onboarding & Compliance Solution. KPMG, 2015 (LINK to .PDF)

Related Posts

High-performance Compliance Capture and Analytics Solution for Financial Institutions. Interview with Michael Hay and Oskar Mencer. ODBMS Industry Watch, Published on 2017-01-26

On fraud detection, Medicaid, and the insurance industry. Interview with Charles Kaminski Jr. ODBMS Industry Watch, Published on 2016-11-01

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/02/on-digital-labor-technology-challenges-and-opportunities-interview-with-michael-henry/feed/ 0
On in-memory, key-value data stores. Ofer Bengal and Yiftach Shoolman http://www.odbms.org/blog/2017/02/on-in-memory-key-value-data-stores-ofer-bengal-and-yiftach-shoolman/ http://www.odbms.org/blog/2017/02/on-in-memory-key-value-data-stores-ofer-bengal-and-yiftach-shoolman/#comments Mon, 13 Feb 2017 10:52:57 +0000 http://www.odbms.org/blog/?p=4278

“While modernizing legacy applications used to be a key reason for deploying in-memory, key-value data stores, we see that this is changing. New applications particularly those that are highly interactive need to bring a user experience that is very responsive under all conditions. For such new applications, an in-memory datastore, particularly one that can simplify run time analytics like counting, scoring, managing lists and sets, is becoming a key ingredient for low latency responses and high throughput.”  –Ofer Bengal.

I have interviewed Ofer Bengal, Co-Founder and CEO of Redis Labs, and Yiftach Shoolman, Co-Founder and CTO of Redis Labs.
Main topics of the interview are: How is the database market evolving, proprietary vs. open source software, in-memory/ key-value data stores, and the new features of Redis.

RVZ

Q1. How do you see the database market evolving?

Ofer Bengal, Yiftach Shoolman: The main trends we identify today and believe will continue in upcoming years are:
1) Non-relational databases will continue to see growing adoption, because the schema framework is ineffective when it comes to unstructured data, change in data patterns, growing data volumes, more stringent performance requirements and the way modern apps are built.
2) Multiple database models as opposed to the absolute dominance of RDMS in the past few decades, each model solving the requirements of certain use cases.
Moreover, certain modern databases can run several database models (document, graph, etc.)
3) Multiple databases (different types or the same type) serving the same app. Modern applications are based on micro service architecture, in which each micro service works with the best database for its use case.
This creates new challenges for modern databases: (a) Instant provisioning – sometime hundreds or thousands of databases are provisioned within a second, and (b) Multi-tenancy, otherwise the cost associated with managing database infrastructure becomes extremely high.
4) Database-as-a-service is growing vs. self deployed and operated databases. With enterprises gradually moving to the cloud and having to deal with multiple type databases, it makes a lot of sense to outsource deployment and ongoing operations rather than building in-house practice of DBAs and Devops.
5) Hybrid transactional and analytical processing (HTAP). Driven by the need for application analytics to drive business decision making in real time, certain modern databases can handle those two different workloads simultaneously, eliminating the need for exporting transactional data to a separate dedicated analytical database.

Q2. Proprietary vs. open source software: what are the pros and cons?

Ofer Bengal, Yiftach Shoolman: From the community perspective, open source is great. If there is a vibrant community, it pushes innovation, problem solving and compatibility issues with different environments.
From users perspective, open source is “open”, accessible, can be used by anyone, transparent, and free of charge.
It often comes with less of a danger of vendor lock-in. It is very suitable for independent developers and startups. However enterprises using open source products may have certain challenges:
1. The product is not always suitable for enterprise workloads, especially when it comes to databases. Capabilities like infinite seamless scaling, high-availability with instant failover and stable performance at scale are not always the open source developer’s top priority.
2. Commercial support must be obtained and this typically comes with a price tag which is not much different than acquiring a commercial database product.
3. Commercial support is typically provided by a single company (most probably founded by the open source creators), which creates “vendor lock-in” by itself.
4. In the case of databases, using database-as-a-service may turn out to be lower in cost compared to provisioning cloud instances and running zero cost open source software on them, because commercial can be based on efficient multi-tenant architecture.

Q3. What is the current market for in-memory, key-value data stores?

Ofer Bengal: In-memory key-value data stores (sometimes called in-memory data grids (IMDGs)) have been around since more than a decade and have proven capable of supporting digital business needs for responsive, always-on user experience; real-time, actionable insights; and dynamic scaling. They are widely employed when you want to scale/modernize legacy applications without spending additional money on extremely expensive RDBMS licenses and hardware.This is achieved by providing a scalable and reliable in-memory datastore that enables low-latency transactional and analytical processing.
While modernizing legacy applications used to be a key reason for deploying in-memory, key-value data stores, we see that this is changing. New applications particularly those that are highly interactive need to bring a user experience that is very responsive under all conditions. For such new applications, an in-memory datastore, particularly one that can simplify run time analytics like counting, scoring, managing lists and sets, is becoming a key ingredient for low latency responses and high throughput.

From a Redis perspective, our innovation in data structures brings about the ability to simplify development to the extent that now most Redis users use it as a first responder and primary datastore for substantial pieces of their data. Furthermore with Redis’ data-structures, users can run operational and analytical use cases on the same database.
In addition, acceleration of other in-memory platforms like Spark is possible with Redis.

Gartner estimates that, in 2015, the stand-alone IMDG market was worth approximately $600 million, having grown by about 30% from the previous year. Gartner expects the market to continue to grow in the double-digit range through 2020 and to exceed $1 billion by 2018. Redis, one of the leaders in this space, grew in just a few years to be one of the most popular databases used by developers and enterprises.

Q4. Amazon ElastiCache supports two open-source in-memory engines: Redis and Memcached. What does it mean in practice?

Yiftach Shoolman: In practice, Amazon ElastiCache is a simple caching service that simplifies a developer experience by providing these two open source in-memory engines. Legacy applications that use simple cache can use ElastiCache seamlessly.
However, ElastiCache is single-tenant, limited to caching use cases and cannot be used as a database, lacking enterprise-grade functionalities such as infinite seamless scalability, instant failover and predictable performance.
The Redis Labs equivalent service, called Redis Cloud provides all the benefits of an enterprise-class Redis.

Q5. What are the pros and cons of Memcached and Redis?

Yiftach Shoolman: Redis can be thought of as modern database while memcached is older technology designed specifically for ephemeral caching.
The most important difference is in persistence and HA – memcached is not persistent nor HA, while Redis can operate as a full-fledged in-memory database, highly available through both in-memory replication and data persistence. This reflects the fact that caches in older architectures were not required to be highly available, but in modern architectures, built for scale and volume, cache outages can significantly impact the business and user experience.
Redis, the newer and more versatile technology allows individual data elements to be manipulated while memcached often incurs serialization/deserialization overheads that makes the entire application processing much slower. This is because Memcached can handle only simple key value use cases, whereas Redis offers many more data structures (hashes, sets, sorted sets, lists, hyperloglog..) that simplify complex data processing, analysis and operational use cases with ease.
Even when used as a cache, Redis has more sophisticated eviction policies which can be both active or passive while memcached has only a simple LRU and lazy eviction.
Redis and Memcached are both very popular open source projects, but given its richer functionality, more advanced design, many potential uses, and greater cost efficiency at scale, Redis should be your first choice in nearly every case.

Q6. For very large data sets or analytics workloads, running everything in-memory might not be cost effective. What is your take on this?

Ofer Bengal, Yiftach Shoolman: For very large data sets or analytics workloads, it is advantageous to utilize alternative memory technologies(such as Flash memory, which is a tenth of the cost), as extensions of memory rather than impose a disk access penalty. We have extended enterprise Redis in this manner to take advantage of Flash memory, while using a tiered approach (keys and hot values are still in the fastest memory, while cold values are in “slower” Flash memory) to ensure that you still see sub-millisecond latencies with millions of ops/sec throughput.

Q7. Redis was created by Salvatore Sanfilippo in 2009. What is his role today?

Ofer Bengal: Salvatore is leading the development of open source Redis within Redis Labs. He works with a group of experienced developers on extending the capabilities of Redis. A good example of this collaborative works is the recent introduction of Redis Modules, which extend Redis to a variety of new modern use cases. Salvatore wrote the API and the other team members in a very short time created and tested a few modules, such as Redisearch (a full-text search engine) and Redis-ML (enhancing the performance of Spark machine learning capabilities). Salvatore’s role is to continue the community innovation around the Redis core, together with his team of Redis Labs developers.

Q8. What are the differences of Redis Labs` version of Redis with the original one developed in 2009?

Yiftach Shoolman: Redis Labs fully supports the open source Redis versions, but enhances them with a container-like layer that adds a proxy, cluster management and a shared nothing architecture. Taken together, Redis Labs provides a solid enterprise foundation to Redis, allowing it to scale seamlessly in memory across many hundreds of servers with the high availability through persistence, in-memory cross-rack/zone/region/datacenter replication and instant automatic failover. No retooling or re-architecting is required to move from open source Redis to enterprise Redis, the process is basically effortless and immediate. Redis Labs also offers various database modules, like a RediSearch, multiple probabilistic modules like Bloom Filter, TopK, CMS, Redis-ML for Machine Learning, Redis-TS for Time Series processing, JSON and Graph support.

Q9. What are the possible scenarios of using Redis for data analytics?

Ofer Bengal, Yiftach Shoolman: Redis data structures come with built-in simple analytic operations like counting, ranking, scoring, ranges and more. Over time, probabilistic data structures have added the ability to analytically estimate millions and trillions of events, without requiring memory to store all of the events.
Set operations have made it possible to simplify comparisons, intersections, unions of sets – analytics that are usually complicated with data stores. RQL (Redis SQL) and secondary indexing, allows executing complex SQL queries on an existing Redis database. And finally recent modules like RediSearch, Neural Redis and Redis-ML have added advanced search and machine learning capabilities – not naturally occurring in any other databases.
With all of these possibilities, and with the move to automated decision making, we see increasing usage of Redis for data analytics scenarios.

Q10. How safe is a Redis server?

Yiftach Shoolman: The Redis enterprise server comes with client-based SSL authentication, built-in cloud firewall support (when running on public clouds), password authentication and role-based authorization that enables customizing security levels.

Qx. Anything else you wish to add?

Ofer Bengal: Redis is a game -changer when it comes to databases, and its progression over the last seven years has demonstrated that the industry and market are demanding performance and increasing flexibility to deal with all types of data processing, storage and analytic scenarios. Redis’ core values have always included high performance, high throughput and very low latencies. With the visionary addition of modules. The community has turned it into an all purpose datastore – suitable for any scenario that needs a database.

____________________________________

Ofer BengalCo-Founder and CEO of Redis Labs
Ofer is a serial entrepreneur who has founded and led several companies in the areas of data communications, telecommunications, Internet, homeland security and medical devices. Ofer was founder & CEO of RIT Technologies (NASDAQ: RITT), a provider of sophisticated telecommunications and data communications systems to major world carriers. He began his career as an aerospace engineer in the Israeli Air Force and then built his own aerospace engineering consulting firm. As a hobby, he has also invented, developed and licensed toy concepts to companies such as Milton Bradley, Hasbro and Tomy. Ofer holds a Bachelor of Science (cum laude) in aerospace engineering from the Technion, Israel Institute of Technology.

Yiftach ShoolmanCo-Founder and CTO of Redis Labs
Yiftach is an experienced technologist, having held leadership engineering and product roles in diverse fields from application acceleration, cloud computing and software-as-a-service (SaaS), to broadband networks and metro networks. He was the founder, president and CTO of Crescendo Networks (acquired by F5, NASDAQ:FFIV), the vice president of software development at Native Networks (acquired by Alcatel, NASDAQ: ALU) and part of the founding team at ECI Telecom broadband division, where he served as vice president of software engineering. Yiftach holds a Bachelor of Science in Mathematics and Computer Science and has completed studies for Master of Science in Computer Science at Tel-Aviv University.

Resources
Redis Cloud Now Available with Integrated Billing through AWS Marketplace- News Release- January 10, 2017.

AWS SaaS Marketplace.

Redis Documentation

EBOOK – REDIS IN ACTION This book covers the use of Redis, an in-memory database/data structure server.

Related Posts

New Gartner Magic Quadrant for Operational Database Management Systems. Interview with Nick Heudecker, ODBMS Industry Watch, November 30, 2016

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/02/on-in-memory-key-value-data-stores-ofer-bengal-and-yiftach-shoolman/feed/ 0
Database Challenges and Innovations. Interview with Jim Starkey http://www.odbms.org/blog/2016/08/database-challenges-and-innovations-interview-with-jim-starkey/ http://www.odbms.org/blog/2016/08/database-challenges-and-innovations-interview-with-jim-starkey/#comments Wed, 31 Aug 2016 03:33:42 +0000 http://www.odbms.org/blog/?p=4218

“Isn’t it ironic that in 2016 a non-skilled user can find a web page from Google’s untold petabytes of data in millisecond time, but a highly trained SQL expert can’t do the same thing in a relational database one billionth the size?.–Jim Starkey.

I have interviewed Jim Starkey. A database legendJim’s career as an entrepreneur, architect, and innovator spans more than three decades of database history.

RVZ

Q1. In your opinion, what are the most significant advances in databases in the last few years?

Jim Starkey: I’d have to say the “atom programming model” where a database is layered on a substrate of peer-to-peer replicating distributed objects rather than disk files. The atom programming model enables scalability, redundancy, high availability, and distribution not available in traditional, disk-based database architectures.

Q2. What was your original motivation to invent the NuoDB Emergent Architecture?

Jim Starkey: It all grew out of a long Sunday morning shower. I knew that the performance limits of single-computer database systems were in sight, so distributing the load was the only possible solution, but existing distributed systems required that a new node copy a complete database or partition before it could do useful work. I started thinking of ways to attack this problem and came up with the idea of peer to peer replicating distributed objects that could be serialized for network delivery and persisted to disk. It was a pretty neat idea. I came out much later with the core architecture nearly complete and very wrinkled (we have an awesome domestic hot water system).

Q3. In your career as an entrepreneur and architect what was the most significant innovation you did?

Jim Starkey: Oh, clearly multi-generational concurrency control (MVCC). The problem I was trying to solve was allowing ad hoc access to a production database for a 4GL product I was working on at the time, but the ramifications go far beyond that. MVCC is the core technology that makes true distributed database systems possible. Transaction serialization is like Newtonian physics – all observers share a single universal reference frame. MVCC is like special relativity, where each observer views the universe from his or her reference frame. The views appear different but are, in fact, consistent.

Q4. Proprietary vs. open source software: what are the pros and cons?

Jim Starkey: It’s complicated. I’ve had feet in both camps for 15 years. But let’s draw a distinction between open source and open development. Open development – where anyone can contribute – is pretty good at delivering implementations of established technologies, but it’s very difficult to push the state of the art in that environment. Innovation, in my experience, requires focus, vision, and consistency that are hard to maintain in open development. If you have a controlled development environment, the question of open source versus propriety is tactics, not philosophy. Yes, there’s an argument that having the source available gives users guarantees they don’t get from proprietary software, but with something as complicated as a database, most users aren’t going to try to master the sources. But having source available lowers the perceived risk of new technologies, which is a big plus.

Q5. You led the Falcon project – a transactional storage engine for the MySQL server- through the acquisition of MySQL by Sun Microsystems. What impact did it have this project in the database space?

Jim Starkey: In all honesty, I’d have to say that Falcon’s most important contribution was its competition with InnoDB. In the end, that competition made InnoDB three times faster. Falcon, multi-version in memory using the disk for backfill, was interesting, but no matter how we cut it, it was limited by the performance of the machine it ran on. It was fast, but no single node database can be fast enough.

Q6. What are the most challenging issues in databases right now?

Jim Starkey: I think it’s time to step back and reexamine the assumptions that have accreted around database technology – data model, API, access language, data semantics, and implementation architectures. The “relational model”, for example, is based on what Codd called relations and we call tables, but otherwise have nothing to do with his mathematic model. That model, based on set theory, requires automatic duplicate elimination. To the best of my knowledge, nobody ever implemented Codd’s model, but we still have tables which bear a scary resemblance to decks of punch cards. Are they necessary? Or do they just get in the way?
Isn’t it ironic that in 2016 a non-skilled user can find a web page from Google’s untold petabytes of data in millisecond time, but a highly trained SQL expert can’t do the same thing in a relational database one billionth the size?. SQL has no provision for flexible text search, no provision for multi-column, multi-table search, and no mechanics in the APIs to handle the results if it could do them. And this is just one a dozen problems that SQL databases can’t handle. It was a really good technical fit for computers, memory, and disks of the 1980’s, but is it right answer now?

Q7. How do you see the database market evolving?

Jim Starkey: I’m afraid my crystal ball isn’t that good. Blobs, another of my creations, spread throughout the industry in two years. MVCC took 25 years to become ubiquitous. I have a good idea of where I think it should go, but little expectation of how or when it will.

Qx. Anything else you wish to add?

Jim Starkey: Let me say a few things about my current project, AmorphousDB, an implementation of the Amorphous Data Model (meaning, no data model at all). AmorphousDB is my modest effort to question everything database.
The best way to think about Amorphous is to envision a relational database and mentally erase the boxes around the tables so all records free float in the same space – including data and metadata. Then, if you’re uncomfortable, add back a “record type” attribute and associated syntactic sugar, so table-type semantics are available, but optional. Then abandon punch card data semantics and view all data as abstract and subject to search. Eliminate the fourteen different types of numbers and strings, leaving simply numbers and strings, but add useful types like URL’s, email addresses, and money. Index everything unless told not to. Finally, imagine an API that fits on a single sheet of paper (OK, 9 point font, both sides) and an implementation that can span hundreds of nodes. That’s AmorphousDB.

————
Jim Starkey invented the NuoDB Emergent Architecture, and developed the initial implementation of the product. He founded NuoDB [formerly NimbusDB] in 2008, and retired at the end of 2012, shortly before the NuoDB product launch.

Jim’s career as an entrepreneur, architect, and innovator spans more than three decades of database history from the Datacomputer project on the fledgling ARPAnet to his most recent startup, NuoDB, Inc. Through the period, he has been
responsible for many database innovations from the date data type to the BLOB to multi-version concurrency control (MVCC). Starkey has extensive experience in proprietary and open source software.

Starkey joined Digital Equipment Corporation in 1975, where he created the Datatrieve family of products, the DEC Standard Relational Interface architecture, and the first of the Rdb products, Rdb/ELN. Starkey was also software architect for DEC’s database machine group.

Leaving DEC in 1984, Starkey founded Interbase Software to develop relational database software for the engineering workstation market. Interbase was a technical leader in the database industry producing the first commercial implementations of heterogeneous networking, blobs, triggers, two phase commit, database events, etc. Ashton-Tate acquired Interbase Software in 1991, and was, in turn, acquired by Borland International a few months later. The Interbase database engine was released open source by Borland in 2000 and became the basis for the Firebird open source database project.

In 2000, Starkey founded Netfrastructure, Inc., to build a unified platform for distributable, high quality Web applications. The Netfrastructure platform included a relational database engine, an integrated search engine, an integrated Java virtual machine, and a high performance page generator.

MySQL, AB, acquired Netfrastructure, Inc. in 2006 to be the kernel of a wholly owned transactional storage engine for the MySQL server, later known as Falcon. Starkey led the Falcon project through the acquisition of MySQL by Sun Microsystems.

Jim has a degree in Mathematics from the University of Wisconsin.
For amusement, Jim codes on weekends, while sailing, but not while flying his plane.

——————

Resources

NuoDB Emergent Architecture (.PDF)

On Database Resilience. Interview with Seth Proctor, ODBMs Industry Watch, March 17, 2015

Related Posts

– Challenges and Opportunities of The Internet of Things. Interview with Steve Cellini, ODBMS Industry Watch, October 7, 2015

– Hands-On with NuoDB and Docker, BY MJ Michaels, NuoDB. ODBMS.org– OCT 27 2015

– How leading Operational DBMSs rank popularity wise? By Michael Waclawiczek– ODBMS.org · JANUARY 27, 2016

– A Glimpse into U-SQL BY Stephen Dillon, Schneider Electric, ODBMS.org-DECEMBER 7, 2015

– Gartner Magic Quadrant for Operational DBMS 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/08/database-challenges-and-innovations-interview-with-jim-starkey/feed/ 0
On the Challenges and Opportunities of IoT. Interview with Steve Graves http://www.odbms.org/blog/2016/07/on-the-challenges-and-opportunities-of-iot-interview-with-steve-graves/ http://www.odbms.org/blog/2016/07/on-the-challenges-and-opportunities-of-iot-interview-with-steve-graves/#comments Wed, 06 Jul 2016 09:00:29 +0000 http://www.odbms.org/blog/?p=4172

“Assembling a team with the wide range of skills needed for a successful IoT project presents an entirely different set of challenges. The skills needed to build a ‘thing’ are markedly different than the skills needed to implement the data analytics in the cloud.”–Steve Graves.

I have interviewed, Steve Graves, co-founder and CEO of McObject. Main topic of the interview is the Internet of Things and how it relates to databases.

RVZ

Q1. What are in your opinion the main Challenges and Opportunities of the Internet of Things (IoT) seen from the perspective of a database vendor?

Steve Graves: Let’s start with the opportunities.

When we started McObject in 2001, we chose “eXtremeDB, the embedded database for intelligent, connected devices” as our tagline. eXtremeDB was designed from the get-go to live in the “things” comprising what the industry now calls the Internet of Things. The popularization of this term has created a lot of visibility and, more importantly, excitement and buzz for what was previously viewed as the relatively boring “embedded systems.” And that creates a lot of opportunities.

A lot of really smart, creative people are thinking of innovative ways to improve our health, our workplace, our environment, our infrastructure, and more. That means new opportunities for vendors of every component of the technology stack.
The challenges are manifold, and I can’t begin to address all of them. The media is largely fixated on security, which itself is multi-dimensional.
We can talk about protecting IoT-enabled devices (e.g. your car) from being hacked. We can talk about protecting the privacy of your data at rest. And we can talk about protecting the privacy of data in motion.
Every vendor needs recognize the importance of security. But, it isn’t enough for a vendor, like McObject, to provide the features to secure the target system; the developer that assembles the stack along with their own proprietary technology to create an IoT solution needs to use available security features, and use them correctly.

After security, scaling IoT systems is the next big challenge. It’s easy enough to prototype something.
But careful planning is needed to leap from prototype to full-blown deployment. Obvious decisions have to be made about connectivity and necessary bandwidth, how many things per gateway, one tier of gateways or more, and how much compute capacity is needed in the cloud. Beyond that, there are less obvious decisions to be made that will affect scalability, like making sure the DBMS used on devices and/or gateways is able to handle the workload (e.g. that the gateway DBMS can scale from 10 input streams to 100 input streams); determining how to divide the analytics workload between gateways and the cloud; and ensuring that the gateway, its DBMS and its communication stack can stream data to the cloud while simultaneously processing its own input streams and analytics.
Assembling a team with the wide range of skills needed for a successful IoT project presents an entirely different set of challenges. The skills needed to build a ‘thing’ are markedly different than the skills needed to implement the data analytics in the cloud. In fact, ‘things’ are usually very much like good ol’ embedded systems, and system engineers that know their way around real-time/embedded operating systems, JTAG debuggers, and so on, have always been at a premium.

Q2. Data management for the IoT: What are the main differences between data management in field-deployed devices and at aggregation points?

Steve Graves: Quite simply: scale. A field-deployed device (or a gateway to field-deployed devices that do not, themselves, have any data management need or capability) has to manage a modest amount of data. But an aggregation point (the cloud being the most obvious example) has to manage many times more data – possibly orders of magnitude more.
At the same time, I have to say that they might not be all that different. Some IoT systems are going to be closed, meaning the nature of the things making up the system is known, and these won’t require much scaling. For example, a building automation system for a small- to mid-size building would have perhaps 100s of sensors and 10s of gateways, and may (or may not) push data up to a central aggregation point. If there are just 10s of gateways, we can create a UI that connects to the database on each gateway where each database is one shard of a single logical database, and execute analytics against that logical database without any need of a central aggregation point. We can extend this hypothetical case to a campus of buildings, or to a landlord with many buildings in a metropolitan area, and then a central aggregation point makes sense.

But the database system would not necessarily be different, only the organization of the physical and logical databases.
The gateways of each building would stream to a database server in the cloud. In the case of 10 buildings, we could have 10 database servers in the cloud that represent 10 shards of that logical database in the cloud. This architecture allows for great scalability. The landlord acquires another building? Great, stand up another database server and the UI connects to 11 shards instead of 10. In this scenario, database servers are software, not hardware. For the numbers we’re talking about (10 or 11 buildings), it could easily be handled by a single hardware server of modest ability.

At the other end of the scale (pun intended) are IoT systems that are wide open. By that, I mean the creators are not able to anticipate the universe of “things” that could be connected, or their quantity. In the first case, the database system should be able to ingest data that was heretofore unknown. This argues for a NoSQL database system, i.e. a database system that is schema-less. In this scenario, the database system on field-deployed devices is probably radically different from the database system in the cloud. Field-deployed devices are purpose-specific, so A) they don’t need and wouldn’t benefit from a NoSQL database system, and B) most NoSQL database systems are too resource-hungry to reside on embedded device nodes.

Q3. If we look at the characteristics of a database system for managing device-based data in the IoT, how do they differ from the characteristics of a database system (typically deployed on a server) for analyzing the “big data” generated by myriad devices?

Steve Graves: Again, let’s recognize that field-deployed devices in the IoT are classic embedded systems. In practical terms, that means relatively modest hardware like an ARM, MIPS, PowerPC or Atom processor running at 100s of megahertz, or perhaps 1 ghz if we’re lucky, and with only enough memory to perform its function. Further, it may require a real-time operating system, or at least an embedded operating system that is less resource hungry than a full-on Linux distro. So, for a database system to run in this environment, it will need to have been designed to run in this environment. It isn’t practical to try to shoehorn in a database system that was written on the assumption that CPU cycles and memory are abundant. It may also be the case that the device has little-to-no persistent storage, which mandates an in-memory database.

So a database system for a field-deployed device is going to
1. have a small code size
2. use little stack
3. preferably, allocate no heap memory
4. have no, or minimal, external dependencies (e.g. not link in an extra 1 MB of code from the C run-time library)
5. have built-in ability to replicate data (to a gateway or directly to the cloud)
a. Replication should be “open”, meaning be able to replicate to a different database system
6. Have built-in security features

7. Nice to have:
a. built-in analytics to aggregate data prior to replicating it
b. ability to define the schema
c. ability to operate entirely in memory

A database system for the cloud might benefit from being schema-less, as described previously. It should certainly have pretty elastic scalability. Servers in the cloud are going to have ample resources and robust operating systems. So a database system for the cloud doesn’t need to have a small code size, use a small amount of stack memory, or worry about external dependencies such as the C run-time library. On the contrary, a database system for the cloud is expected to do much more (handle data at scale, execute analytics, etc.) and will, therefore, need ample resources. In fact, this database system should be able to take maximum advantage of the resources available, including being able to scale horizontally (across cores, CPUs, and servers).
In summary, the edge (device-based) DBMS needs to operate in a constrained environment. A cloud DBMS needs to be able to effectively and efficiently utilize the ample resources available to it.

Q4. Why is the ability to define a database schema important (versus a schema-less DBMS, aka NoSQL) for field-deployed devices?

Steve Graves: Field-deployed devices will normally perform a few specific functions (sometimes, just one function). For example, a building automation system manages HVAC, lighting, etc. A livestock management system manages feed, output, and so on. In such systems, the data requirements are well known. The hallmark NoSQL advantage of being able to store data without predefining its structure is unwarranted. The other purported hallmark of NoSQL is horizontal scalability, but this is not a need for field-deployed devices.
Walking away from the relational database model (and its implicit use of a database schema) has serious implications.
A great deal of scientific knowledge has been amassed around the relational database model over the last few decades, and without it developers are completely on their own with respect to enforcing sound data management practices.

In the NoSQL sphere, there is nothing comparable to the relational model (e.g. E.F. Codd’s work) and the mathematical foundation (relational calculus) underpinning it.
There should be overwhelming justification for a decision to not use relational.
In my experience, that justification is absent for data management of field-deployed devices.
A database system that “knows” the data design (via a schema) can more intelligently manage the data. For example, it can manage constraints, domain dependencies, events and much more. And some of the purported inflexibility imposed by a schema can be eliminated if the DBMS supports dynamic DDL (see more details on this in the answer to question Q6, below).

Q5. In your opinion, do IoT aggregation points resemble data lakes?

Steve Graves: The term data lake was originally conceived in the context of Hadoop and map-reduce functionality. In more recent times, the meaning of the term has morphed to become synonymous with big data, and that is how I use the term. Insofar as a gateway can also be an aggregation point, I would not say ‘aggregation points resemble data lakes’ because gateway aggregation points, in all likelihood, will not manage Big Data.

Q6. What are the main technical challenges for database systems used to accommodate new and unforeseen data, for example when a new type of device begins streaming data?

Steve Graves: The obvious challenges are
1. The ability to ingest new data that has a previously unknown structure
2. The ability to execute analytics on #1
3. The ability to integrate analytics on #1 with analytics on previously known data

#1 is handled well by NoSQL DBMSs. But, it might also be handled well by an RDBMS via “dynamic DDL” (dynamic data definition language), e.g. the ability to execute CREATE TABLE, ALTER TABLE, and/or CREATE INDEX statements against an existing database.
To efficiently execute analytics against any data, the structure of the data must eventually be understood.
RDBMS handle this through the database dictionary (the binary equivalent of the data definition language).
But some NoSQL DBMSs handle this through different meta data. For example, the MarkLogic DBMS uses JSON metadata to understand the structure of documents in its document store.
NoSQL DBMSs with no meta data whatsoever put the entire burden on the developers. In other words, since the data is opaque to the DBMS, the application code must read and interpret the content.

Q7. Client/server DBMS architecture vs. in-process DBMSs: which one is more suitable for IoT?

Steve Graves: For edge DBMSs (on constrained devices), an in-process architecture will be more suitable. It requires fewer resources than client/server architecture, and imposes less latency through elimination of inter-process communication. For cloud DBMSs, a client/server architecture will be more suitable. In the cloud environment, resources are not scarce, and the the advantage of being able to scale horizontally will outweigh the added latency associated with client/server.

Qx Anything else you wish to add?

Steve Graves: We feel that eXtremeDB is uniquely positioned for the Internet of Things. Not only have devices and gateways been in eXtremeDB’s wheelhouse for 15 years with over 25 million real world deployments, but the scalability, time series data management, and analytics built into the eXtremeDB server (big data) offering make it an attractive cloud database solution as well. Being able to leverage a single DBMS across devices, gateways and the cloud has obvious synergistic advantages.

———————
Steve Graves is co-founder and CEO of McObject, a company specializing in embedded Database Management System (DBMS) software. Prior to McObject, Steve was president and chairman of Centura Solutions Corporation and vice president of worldwide consulting for Centura Software Corporation.

Resources

Big Data, Analytics, and the Internet of Things, by Mohak Shah, analytics leader and research scientist at Bosch Research, USA.ODBMS.org APRIL 6, 2015

 Privacy considerations & responsibilities in the era of Big Data & Internet of Things, by Ramkumar Ravichandran, Director, Analytics, Visa Inc. ODBMS.org January 8, 2015.

 Securing Your Largest USB-Connected Device: Your Car,BY Shomit Ghose, General Partner, ONSET Ventures, ODBMs.org MARCH 31, 2016.

 eXtremeDB Financial Edition DBMS Sweeps Records in Big Data Benchmark,ODBMS.org JULY 2, 2016

 eXtremeDB in-memory database

 User Experience Design for the Internet of Things

Related Posts

On the Internet of Things. Interview with Colin MahonyODBMS Industry Watch, Published on 2016-03-14

A Grand Tour of Big Data. Interview with Alan MorrisonODBMS Industry Watch, Published on 2016-02-25

On the Industrial Internet of Things. Interview with Leon Guzenda, ODBMS Industry Watch,  January 28, 2016

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/07/on-the-challenges-and-opportunities-of-iot-interview-with-steve-graves/feed/ 0
Using NoSQL for Ireland’s Online Tax Research Database. http://www.odbms.org/blog/2016/05/using-nosql-for-irelands-online-tax-research-database/ http://www.odbms.org/blog/2016/05/using-nosql-for-irelands-online-tax-research-database/#comments Mon, 02 May 2016 08:18:17 +0000 http://www.odbms.org/blog/?p=4128

“When the Institute began to look for a new platform, it became apparent that a relational database was not the best solution to effectively manage and deliver our XML content.”–Martin Lambe.

The Irish Tax Institute is the leading representative and educational body for Ireland’s AITI Chartered Tax Advisers (CTA) and is the only professional body exclusively dedicated to tax. One of their service is TaxFind – Ireland’s Leading Online Tax Research Database, offering Search to 200,000 pages of tax content, over 8,000 pages of Irish tax legislation, Irish Tax Institute tax technical papers, over 25 leading tax commentary publications, and 1000s of Irish Tax Review articles.

I did a joint interview with Martin Lambe, CEO of the Irish Tax Institute and Sam Herbert, Client Services Director at 67 Bricks.
Main topics of the interview are the data challenges they currently face, and the implementation of TaxFind using MarkLogic.

RVZ

Q1. What are the main data challenges you currently have at the Irish Tax Institute?

Martin Lambe: The Irish Tax Institute moved its publication workflow to an XML-based process in 2009 and we have a large archive of valuable tax information contained in quite complex XML format. The main challenge was to find a solution that could store the repository of data (XML and other formats) and provide a simple search interface that directs users very quickly to the most relevant result. The “findability” of relevant content is crucial.

Q2. What is the TaxFind research database?

Martin Lambe: The Irish Tax Institute is the main provider of tax information in Ireland and TaxFind is the Institute’s online tax research database. TaxFind offers subscribers access to Irish tax legislation and guidance that includes tax technical papers from seminars and conferences, as well as over 30 tax commentary publications. It is used by thousands of CTAs in Ireland on a daily basis to assist in their tax research.

Q3. Who are the members that benefit from this TaxFind research database?

Martin Lambe: TaxFind serves the Chartered Tax Adviser (CTA) community in Ireland and other tax professionals such as those in the global accounting firms.

Q4. Why did you discard your previous implementation with a relational database system?

Martin Lambe: The previous database was literally creaking at the seams. Users were increasingly frustrated with difficulties accessing the database on different browsers and the old platform did not support mobile devices or tablets. When the Institute began to look for a new platform, it became apparent that a relational database was not the best solution to effectively manage and deliver our XML content. XML content stored in a NoSQL document database is indexed specifically for the search engine and this means the performance of our search engine and the relevancy of results is dramatically improved.

Q5. Why did you select MarkLogic`s NoSQL database platform?

Sam Herbert: MarkLogic is scalable to support fast querying across large amounts of data, it deals with XML content very well (and most of the tax data is either in XML, or in HTML that can be treated as XHTML), and has good searching. It is also a good environment to develop in – it has excellent documentation, and good tooling. It helps that it uses XQuery as one of its query languages, rather than a proprietary database-specific language.

Q6. Is SQL still important for you?

Sam Herbert: I don’t think it’s true to say that any particular type of technology is “important” to ITI – it’s all about how it can benefit users. From a 67 Bricks perspective, we work with relational databases, NoSQL databases, and graph databases depending on what shape the data is and what the needs are around querying it.

Q7 Why not choose an open source solution?

Sam Herbert: We’re using Open Source components in other parts of the system, and we’re keen on using Open Source where possible. However, for the data store, there aren’t any Open Source alternatives that have the combination of good scalability, good support for XML content, a standard query language, and powerful searching that we were looking for.

Q8. Can you tell us a bit about the architecture of the new implementation of the TaxFind research database

Sam Herbert: There are three major components:

– a frontend display and service layer written using the Play framework
– the MarkLogic data store
– a semantic enrichment component using Semaphore SmartLogic and the ITI taxonomy

The Play component is what users interact with – both for human users coming to the web site, and automated use of the web services. The bulk of the data retrieval and manipulation is done via a set of XQuery functions defined within the MarkLogic store. When new data is uploaded, it is processed within the Play code, enriched using Semaphore SmartLogic, and then stored in MarkLogic.

Q9. How do you manage to integrate Irish Tax Institute`s tax data, bringing together in excess of 300,000 pages of tax content including archive material in Word, PDF, XML and HTML?

Sam Herbert: The most complex part of the data is the XML content. These are very large XML files representing legislation, books, and other tax materials, that are inter-related in complex ways, and with a lot of deeply nested hierarchy. An important part of managing the data was splitting these into appropriately sized fragments, and then identifying the linking between different files – for example a piece of legislation will refer to other legislation, and commentary will refer to that legislation, and a new piece of legislation may supersede an earlier piece.

The non-XML content is larger in volume, but each individual document is smaller and is structurally simpler. Managing this content was largely a matter of loading it in and letting it be indexed.

Q10. How do you capture and digitize information in various formats and make it searchable?

Sam Herbert: Making it searchable is straightforward – it’s making it searchable in ways that support the expectations of the users that’s much more difficult.

A good search experience requires both subject matter expertise and good automated tests.

The basic search is using MarkLogic’s full text search. The next step was to work with tax experts within and outside the ITI to identify appropriate facets within the content with which to group the results – based on a combination of what the user requirements were and what was supported by the data.

There were additional complexities around weighting the search results to make the “best” results come at the top in as many circumstances as possible – for example, weighting terms within headings, weighting more recent content, weighting content based on its category so legislation is more important than commentary, and weighting content higher based on its popularity. The semantic enrichment based on tax terms from the ITI taxonomy also enhances the searching.

Q11. How do you ensure that this solution is scalable?

Sam Herbert: The solution is deployed to a load-balanced cluster using Amazon Web Services. The Play frontend is purely stateless REST. This means that we can scale to support more users easily by spinning up more servers – and using AWS makes this easy. Overall, using AWS has been a big win for us, in terms of being able to get servers running easily, being able to increase and decrease things like their memory size easily, and the various ancillary services it provides like DNS and load balancing. By making sure we can scale to support additional data, we can use MarkLogic effectively.

————-

Martin Lambe is Chief Executive of the Irish Tax Institute. His previous role within the Institute was that of Director of Finance.

Sam Herbert is Client Services Director at 67 Bricks, a company that works with information owners (particularly publishers) who want to enrich their content to make it more structured, granular, flexible and reusable.
67 Bricks utilises its deep understanding of the content enrichment challenge to help publishers develop systems and capabilities to increase the value of their content. With expertise in XML, business analysis, semantic tagging and software development, 67 Bricks works closely with its clients to develop and implement content enrichment capabilities and enriched content digital products.

————-
Resources

Irish Tax Institute

TaxFind

67 Bricks

MarkLogic

Related Posts

The rise of immutable data stores. By Alan Morrison, Senior Manager, PwC Center for technology and innovation (CTI). ODBMS.org

Unthink: Moving Beyond the Constraints of Relational Databases. by Tom McGrath, MarkLogic. ODBMS.org March 14, 2016.

MarkLogic Case Study: Royal Society of Chemistry.ODBMS.org

On making information accessible. Interview with David Leeming. ODBMS Industry Watch, on July 30, 2014

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/05/using-nosql-for-irelands-online-tax-research-database/feed/ 0
On Big Data and Data Science. Interview with James Kobielus http://www.odbms.org/blog/2016/04/on-big-data-and-data-science-interview-with-james-kobielus/ http://www.odbms.org/blog/2016/04/on-big-data-and-data-science-interview-with-james-kobielus/#comments Tue, 19 Apr 2016 08:34:09 +0000 http://www.odbms.org/blog/?p=4119

“One of the most typical mistakes in large-scale data projects is losing sight of the biases that may skew the insights you extract.”– James Kobielus

On the topics of Big Data, and Data Science, I have interviewed James Kobielus, IBM Big Data Evangelist.

RVZ

Q1. What kind of companies generate Big Data, besides the Internet giants?

James Kobielus: Big data isn’t something you “generate.” Rather, the term refers to the ability to achieve differentiated value from advanced analytics on trustworthy data at any scale. In other words, it’s a best practice, not a specific type of data or even a specific scale of data (measured in volume, velocity, and/or variety).

When considered in this light, you can identify big data analytic applications in every industry. Every C-level executive has strategic applications of big data. Here are just a smattering:

  • Chief Marketing Officers have been the prime movers on many big data initiatives that involve Hadoop, NoSQL, and other approaches. Their primary applications consist of marketing campaign optimization, customer churn and loyalty, upsell and cross-sell analysis, targeted offers, behavioral targeting, social media monitoring, sentiment analysis, brand monitoring, influencer analysis, customer experience optimization, content optimization, and placement optimization
  • Chief Information Officers use big data platforms for data discovery, data integration, business analytics, advanced analytics, exploratory data science.
  • Chief Operations Officers rely on big data for supply chain optimization, defect tracking, sensor monitoring, and smart grid, among other applications.
  • Chief Information Security Officer run security incident and event management, anti-fraud detection, and other sensitive applications on big data.
  • Chief Technology Officers do IT log analysis, event analytics, network analytics, and other systems monitoring, troubleshooting, and optimization applications on big data.
  • Chief Financial Officers run complex financial risk analysis and mitigation modeling exercises on big data platforms.

Q2. What are the most challenging problems you are facing when analysing Big Data?

James Kobielus: Searching for actionable intelligence in big data involves building and testing advanced-analytics models against large volumes of complex data that may be flowing in at high velocities.

At these scales, it’s easy to get overwhelmed in your analysis unless you automate the end-to-end processes of extracting intelligence at scale. Automation can also help control the cost of managing a growing volume of algorithmic models against ever expanding big-data collections. The key processes that need automating are data discovery, profiling, sampling, and preparation, as well as model building, scoring, and deployment.

Q3. How do you typically handle them?

James Kobielus: Automating the modeling process will boost data scientist productivity by an order of magnitude, freeing them from drudgery so that they can focus on the sorts of exploration, modeling, and visualization challenges that demand expert human judgment. Data scientists can accelerate their modeling automation initiatives by following these steps:

  • Virtualize access to data, metadata, rules, and predictive models, as well as to data integration, data warehousing, and advanced analytic applications through a BI semantic virtualization layer;
  • Unify access, governance, orchestration, automation, and administration across these resources within a service-oriented architecture;
  • Explore commercial tools that support maximum automation of model development, scoring, deployment, and execution;
  • Consolidate, accelerate, and deepen predictive analytics through integration into big-data platforms with scalable in-database execution; and
  • Migrate existing analytical data marts into multidomain big-data platforms with unified data, metadata, and model governance within service-oriented virtualization framework.

Q4. What are in your experience the typical mistakes made in large scale data projects?

James Kobielus: One of the most typical mistakes in large-scale data projects is losing sight of the biases that may skew the insights you extract.

Even if you accept that a data scientist’s integrity is rock-solid, intentions pure, skills stellar, and discipline rigorous, there’s no denying that bias may creep inadvertently into their work with big data. The biases may be minor or major, episodic or systematic, tangential or material to their findings and recommendations. Whatever their nature, the biases must be understood and corrected as fully as possible.

Here are some of the key sources of bias that may crop up in a data scientist’s work with big data:

  • Cognitive bias: This is the tendency to make skewed decisions based on pre-existing cognitive and heuristic factors–such as a misunderstanding of probabilities–rather than on the data and other hard evidence. You might say that the educated intuition that drives data science is rife with cognitive bias, but that’s not always a bad thing.
  • Selection bias: This is the tendency to skew your choice of data sources to those that may be most available, convenient, and cost-effective for your purposes, as opposed to being necessarily the most valid and relevant for your study. Clearly, data scientists do not have unlimited budgets, may operate under tight deadlines, and don’t use data for which they lack authorization. These constraints may introduce an unconscious bias in the big-data collections they are able to assemble.
  • Sampling bias: This is the tendency to skew the sampling of data sets toward subgroups of the population most relevant to the initial scope of a data-science project, thereby making it unlikely that you will uncover any meaningful correlations that may apply to other segments. Another source of sampling bias is “data dredging,” in which the data scientist uses regression techniques that may find correlations in samples but that may not be statistically significant in the wider population. Consequently, you’re likely to spuriously confirm your initial model for the segments that happen to make the sampling cut.
  • Modeling bias: Beyond the biases just discussed, this is the tendency to skew data-science models by starting with a biased set of project assumptions that drive selection of the wrong variables, the wrong data, the wrong algorithms, and the wrong metrics of fitness. In addition, overfitting of models to past data without regard for predictive lift is a common bias. Likewise, failure to score and iterate models in a timely fashion with fresh observational data also introduces model decay, hence bias.
  • Funding bias: This may be the most silent but pernicious bias in data-scientific studies of all sorts. It’s the unconscious tendency to skew all modeling assumptions, interpretations, data, and applications to favor the interests of the party–employer, customer, sponsor, etc.–that employs or otherwise financially supports the data-science initiative. Funding bias makes it highly unlikely that data scientists will uncover disruptive insights that will “break the rice bowl” in which they make their living.

Q5. How do you measure “success” when analysing data?

James Kobielus: You measure success in your ability to distill useful insights in a timely fashion from the data at your disposal.

Q6. What skills are required to be an effective Data Scientist?

James Kobielus: Data science’s learning curve is formidable. To a great degree, you will need a degree, or something substantially like it, to prove you’re committed to this career. You will need to submit yourself to a structured curriculum to certify you’ve spent the time, money and midnight oil necessary for mastering this demanding discipline.

Sure, there are run-of-the-mill degrees in data-science-related fields, and then there are uppercase, boldface, bragging-rights “DEGREES.” To some extent, it matters whether you get that old data-science sheepskin from a traditional university vs. an online school vs. a vendor-sponsored learning program. And it matters whether you only logged a year in the classroom vs. sacrificed a considerable portion of your life reaching for the golden ring of a Ph.D. And it certainly matters whether you simply skimmed the surface of old-school data science vs. pursued a deep specialization in a leading-edge advanced analytic discipline.

But what matters most to modern business isn’t that every data scientist has a big honking doctorate. What matters most is that a substantial body of personnel has a common grounding in core curriculum of skills, tools and approaches. Ideally, you want to build a team where diverse specialists with a shared foundation can collaborate productively.

Big data initiatives thrive if all data scientists have been trained and certified on a curriculum with the following foundation:

  • Paradigms and practices: Every data scientist should acquire a grounding in core concepts of data science, analytics and data management. They should gain a common understanding of the data science lifecycle, as well as the typical roles and responsibilities of data scientists in every phase. They should be instructed on the various role(s) of data scientists and how they work in teams and in conjunction with business domain experts and stakeholders. And they learn a standard approach for establishing, managing and operationalizing data science projects in the business.
  • Algorithms and modeling: Every data scientist should obtain a core understanding of linear algebra, basic statistics, linear and logistic regression, data mining, predictive modeling, cluster analysis, association rules, market basket analysis, decision trees, time-series analysis, forecasting, machine learning, Bayesian and Monte Carlo Statistics, matrix operations, sampling, text analytics, summarization, classification, primary components analysis, experimental design, unsupervised learning constrained optimization.
  • Tools and platforms: Every data scientist should master a core group of modeling, development and visualization tools used on your data science projects, as well as the platforms used for storage, execution, integration and governance of big data in your organization. Depending on your environment, and the extent to which data scientists work with both structured and unstructured data, this may involve some combination of data warehousing, Hadoop, stream computing, NoSQL and other platforms. It will probably also entail providing instruction in MapReduce, R and other new open-source development languages, in addition to SPSS, SAS and any other established tools.
  • Applications and outcomes: Every data scientist should learn the chief business applications of data science in your organization, as well as in how to work best with subject-domain experts. In many companies, data science focuses on marketing, customer service, next best offer, and other customer-centric applications. Often, these applications require that data scientists understand how to leverage customer data acquired from structured survey tools, sentiment analysis software, social media monitoring tools and other sources. It also essential that every data scientist gain an understanding of the key business outcomes–such as maximizing customer lifetime value–that should focus their modeling initiatives.

Classroom instruction is important, but a curriculum that is 100 percent devoted to reading books, taking tests and sitting through lectures is insufficient. Hands-on laboratory work is paramount for a truly well-rounded data scientist. Make sure that your data scientists acquire certifications and degrees that reflect them actually developing statistical models that use real data and address substantive business issues.

A business-oriented data-science curriculum should produce expert developers of statistical and predictive models. It should not degenerate into a program that produces analytics geeks with heads stuffed with theory but whose diplomas are only fit for hanging on the wall.

Q7. Hadoop vs. Spark: what are the pros and cons?

James Kobielus: Big data analytics infrastructures are growing more hybridized than ever. Every new technology—such as Hadoop, in-memory databases, and graph databases—finds its specific niche in terms of use cases, deployment modes, and applications for which it is best suited.

Even as Apache Spark pushes more deeply into big-data environments, it won’t substantially change this trend. Yes, of course Spark is on the fast track to ubiquity in big-data analytics. This is especially true for the next generation of machine-learning applications that feed on growing in-memory pools and require low-latency distributed computations for streaming and graph analytics. But those use cases aren’t the sum total of big-data analytics and never will be.

As we all grow more infatuated with Spark, it’s important to continually remind ourselves of what it’s not suitable for. If, for example, one considers all the critical data management, integration, and preparation tasks that must be performed prior to modeling in Spark, it’s clear that these will not be executed in any of the Spark engines (Spark SQL, Spark Streaming, GraphX). Instead, they’ll be carried out in the data platforms and elastic clusters (HDFS, Cassandra, HBase, Mesos, cloud services, etc.) upon which those engines run. Likewise, you’d be hardpressed to find anyone who’s seriously considering Spark in isolation for data warehousing, data governance, master data management, or operational business intelligence.

Above all else, Spark is the new power tool for data scientists who are pushing boundaries in the emerging era of in-memory big data analytics in low-latency scenarios of all types. Spark is proving its value as a development tool for the new generation of data scientists building the in-memory statistical models upon which it all will depend.

Let’s not fall into the delusion that everything is converging toward Spark, as if it were the ravenous maw that will devour every other big-data analytics tool and platform. Spark is just another approach that’s being fitted to and optimized for specific purposes.

And let’s resist the hype that treats Spark as Hadoop’s “successor.” This implies that Hadoop and other big-data approaches are “legacy,” rather than what they are, which is foundational. For example, no one is seriously considering doing “data lakes,” “data reservoirs,” or “data refineries” on anything but Hadoop or NoSQL.

——————–

James Kobielus is an industry veteran and serves as IBM Big Data Evangelist; Senior Program Director for Product Marketing in Big Data Analytics; and Team Lead, Technical Marketing, IBM Big Data & Analytics Hub. He spearheads thought leadership activities across the IBM Analytics solution portfolio. He has spoken at such leading industry events as IBM Insight, Hadoop Summit, and Strata. He has published several business technology books and is a very popular provider of original commentary on blogs and many social media.

Resources

–  Master of Information and Data Science,  UC Berkeley School of Information.

– MS in Data Science, NYU Center for Data Science.

– Free data science curriculum, kdnuggets.com

Data Science | Coursera

– Master of Science in Data Science – Data Science Institute

Data Mining and Applications Graduate Certificate, Stanford

The European Data Science Academy (EDSA) designs curricula for data science training and data science education across the European Union (EU).

-The EDISON project will focus on activities to establish the new profession of ‘Data Scientist’, following the emergence of Data Science technologies (also referred to as Data Intensive or Big Data technologies) which changes the way research is done, how scientists think and how the research data are used and shared. This includes definition of the required skills, competences framework/profile, corresponding Body Of Knowledge and model curriculum. It will develop a sustainability/business model to ensure a sustainable increase of Data Scientists, graduated from universities and trained by other professional education and training institutions in Europe. 
EDISON will facilitate the establishment of a Data Science education and training infrastructure at major European universities by promoting experience of ‘champion’ universities involving them into coordinated development and implementation of the model curriculum and creation of cooperative educational and training infrastructure.

Related Posts

– RIP Big Data, By Carl Olofson, Research Vice President, Data Management Software Research, IDC. ODBMS.org, January  2016

Open Source Software and IBM’s Big Data platform. By Cynthia M. Saracco, senior solutions architect at IBM’s Silicon Valley Laboratory. ODBMS.org, April 2016.

Looking back at Big Data in 2015, By Cynthia M. Saracco, IBM Senior Solution Architect, ODBMS.org. November 2015

–  Heuristics for a Data Scientist: A common sense approach. BY Silvia Dassiè, Data Scientist at Ryanair. ODBMS.org, December 2015

The rise of immutable data stores. By Alan Morrison, Senior Manager, PwC Center for technology and innovation. ODBMS.org. October 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/04/on-big-data-and-data-science-interview-with-james-kobielus/feed/ 0
On the Internet of Things. Interview with Colin Mahony http://www.odbms.org/blog/2016/03/on-the-internet-of-things-interview-with-colin-mahony/ http://www.odbms.org/blog/2016/03/on-the-internet-of-things-interview-with-colin-mahony/#comments Mon, 14 Mar 2016 08:45:56 +0000 http://www.odbms.org/blog/?p=4101

“Frankly, manufacturers are terrified to flood their data centers with these unprecedented volumes of sensor and network data.”– Colin Mahony

I have interviewed Colin Mahony, SVP & General Manager, HPE Big Data Platform. Topics of the interview are: The challenges of the Internet of Things, the opportunities for Data Analytics, the positioning of HPE Vertica and HPE Cloud Strategy.

RVZ

Q1. Gartner says 6.4 billion connected “things” will be in use in 2016, up 30 percent from 2015.  How do you see the global Internet of Things (IoT) market developing in the next years?

Colin Mahony: As manufacturers connect more of their “things,” they have an increased need for analytics to derive insight from massive volumes of sensor or machine data. I see these manufacturers, particularly manufacturers of commodity equipment, with a need to provide more value-added services based on their ability to provide higher levels of service and overall customer satisfaction. Data analytics platforms are key to making that happen. Also, we could see entirely new analytical applications emerge, driven by what consumers want to know about their devices and combine that data with, say, their exercise regimens, health vitals, social activities, and even driving behavior, for full personal insight.
Ultimately, the Internet of Things will drive a need for the Analyzer of Things, and that is our mission.

Q2. What Challenges and Opportunities bring the Internet of Things (IoT)? 

Colin Mahony: Frankly, manufacturers are terrified to flood their data centers with these unprecedented volumes of sensor and network data. The reason? Traditional data warehouses were designed well before the Internet of Things, or, at least before OT (operational technology) like medical devices, industrial equipment, cars, and more were connected to the Internet. So, having an analytical platform to provide the scale and performance required to handle these volumes is important, but customers are taking more of a two- or three-tier approach that involves some sort of analytical processing at the edge before data is sent to an analytical data store. Apache Kafka is also becoming an important tier in this architecture, serving as a message bus, to collect and push that data from the edge in streams to the appropriate database, CRM system, or analytical platform for, as an example, correlation of fault data over months or even years to predict and prevent part failure and optimize inventory levels.

Q3. Big Data: In your opinion, what are the current main demands/needs in the market?

Colin Mahony: All organizations want – and need – to become data-driven organizations. I mean, who wants to make such critical decisions based on half answers and anecdotal data? That said, traditional companies with data stores and systems going back 30-40 years don’t have the same level playing field as the next market disruptor that just received their series B funding and only knows that analytics is the life blood of their business and all their critical decisions.
The good news is that whether you are a 100-year old insurance company or the next Uber or Facebook, you can become a data-driven organization by taking an open platform approach that uses the best tool for the job and can incorporate emerging technologies like Kafka and Spark without having to bolt on or buy all of that technology from a single vendor and get locked in.  Understanding the difference between an open platform with a rich ecosystem and open source software as one very important part of that ecosystem has been a differentiator for our customers.

Beyond technology, we have customers that establish analytical centers of excellence that actually work with the data consumers – often business analysts – that run ad-hoc queries using their preferred data visualization tool to get the insight need for their business unit or department. If the data analysts struggle, then this center of excellence, which happens to report up through IT, collaborates with them to understand and help them get to the analytical insight – rather than simply halting the queries with no guidance on how to improve.

Q4. How do you embed analytics and why is it useful? 

Colin Mahony: OEM software vendors, particularly, see the value of embedding analytics in their commercial software products or software as a service (SaaS) offerings.  They profit by creating analytic data management features or entirely new applications that put customers on a faster path to better, data-driven decision making. Offering such analytics capabilities enables them to not only keep a larger share of their customer’s budget, but at the same time greatly improve customer satisfaction. To offer such capabilities, many embedded software providers are attempting unorthodox fixes with row-oriented OLTP databases, document stores, and Hadoop variations that were never designed for heavy analytic workloads at the volume, velocity, and variety of today’s enterprise. Alternatively, some companies are attempting to build their own big data management systems. But such custom database solutions can take thousands of hours of research and development, require specialized support and training, and may not be as adaptable to continuous enhancement as a pure-play analytics platform. Both approaches are costly and often outside the core competency of businesses that are looking to bring solutions to market quickly.

Because it’s specifically designed for analytic workloads, HPE Vertica is quite different from other commercial alternatives. Vertica differs from OLTP DBMS and proprietary appliances (which typically embed row-store DBMSs) by grouping data together on disk by column rather than by row (that is, so that the next piece of data read off disk is the next attribute in a column, not the next attribute in a row). This enables Vertica to read only the columns referenced by the query, instead of scanning the whole table as row-oriented databases must do. This speeds up query processing dramatically by reducing disk I/O.

You’ll find Vertica as the core analytical engine behind some popular products, including Lancope, Empirix, Good Data, and others as well as many HPE offerings like HPE Operations Analytics, HPE Application Defender, and HPE App Pulse Mobile, and more.

Q5. How do you make a decision when it is more appropriate to “consume and deploy” Big Data on premise, in the cloud, on demand and on Hadoop?

Colin Mahony: The best part is that you don’t need to choose with HPE. Unlike most emerging data warehouses as a service where your data is trapped in their databases when your priorities or IT policies change, HPE offers the most complete range of deployment and consumption models. If you want to spin up your analytical initiative on the cloud for a proof-of-concept or during the holiday shopping season for e-retailers, you can do that easily with HPE Vertica OnDemand.
If your organization finds that due to security or confidentiality or privacy concerns you need to bring your analytical initiative back in house, then you can use HPE Vertica Enterprise on-premises without losing any customizations or disruption to your business. Have petabyte volumes of largely unstructured data where the value is unknown? Use HPE Vertica for SQL on Hadoop, deployed natively on your Hadoop cluster, regardless of the distribution you have chosen. Each consumption model, available in the cloud, on-premise, on-demand, or using reference architectures for HPE servers, is available to you with that same trusted underlying core.

Q6. What are the new class of infrastructures called “composable”? Are they relevant for Big Data?

Colin Mahony: HPE believes that a new architecture is needed for Big Data – one that is designed to power innovation and value creation for the new breed of applications while running traditional workloads more efficiently.
We call this new architectural approach Composable Infrastructure. HPE has a well-established track record of infrastructure innovation and success. HPE Converged Infrastructure, software-defined management, and hyper-converged systems have consistently proven to reduce costs and increase operational efficiency by eliminating silos and freeing available compute, storage, and networking resources. Building on our converged infrastructure knowledge and experience, we have designed a new architecture that can meet the growing demands for a faster, more open, and continuous infrastructure.

Q7. What is HPE Cloud Strategy? 

Colin Mahony: Hybrid cloud adoption is continuing to grow at a rapid rate and a majority of our customers recognize that they simply can’t achieve the full measure of their business goals by consuming only one kind of cloud.
HPE Helion not only offers private cloud deployments and managed private cloud services, but we have created the HPE Helion Network, a global ecosystem of service providers, ISVs, and VARs dedicated to delivering open standards-based hybrid cloud services to enterprise customers. Through our ecosystem, our customers gain access to an expanded set of cloud services and improve their abilities to meet country-specific data regulations.

In addition to the private cloud offerings, we have a strategic and close alliance with Microsoft Azure, which enables many of our offerings, including Haven OnDemand, in the public cloud. We also work closely with Amazon because our strategy is not to limit our customers, but to ensure that they have the choices they need and the services and support they can depend upon.

Q8. What are the advantages of an offering like Vertica in this space?

Colin Mahony: More and more companies are exploring the possibility of moving their data analytics operations to the cloud. We offer HPE Vertica OnDemand, our data warehouse as a service, for organizations that need high-performance enterprise class data analytics for all of their data to make better business decisions now. Built by design to drastically improve query performance over traditional relational database systems, HPE Vertica OnDemand is engineered from the same technology that powers the HPE Vertica Analytics Platform. For organizations that want to select Amazon hardware and still maintain the control over the installation, configuration, and overall maintenance of Vertica for ultimate performance and control, we offer Vertica AMI (Amazon Machine Image). The Vertica AMI is a bring-your-own-license model that is ideal for organizations that want the same experience as on-premise installations, only without procuring and setting up hardware. Regardless of which deployment model to choose, we have you covered for “on demand” or “enterprise cloud” options.

Q9. What is HPE Vertica Community Edition?

Colin Mahony: We have had tens of thousands of downloads of the HPE Vertica Community Edition, a freemium edition of HPE Vertica with all of the core features and functionality that you experience with our core enterprise offering. It’s completely free for up to 1 TB of data storage across three nodes. Companies of all sizes prefer the Community Edition to download, install, set-up, and configure Vertica very quickly on x86 hardware or use our Amazon Machine Image (AMI) for a bring-your-own-license approach to the cloud.

Q10. Can you tell us how Kiva.org, a non-profit organization, uses on-demand cloud analytics to leverage the internet and a worldwide network of microfinance institutions to help fight poverty? 

Colin Mahony: HPE is a major supporter of Kiva.org, a non-profit organization with a mission to connect people through lending to alleviate poverty. Kiva.org uses the internet and a worldwide network of microfinance institutions to enable individuals lend as little as $25 to help create opportunity around the world. When the opportunity arose to help support Kiva.org with an analytical platform to further the cause, we jumped at the opportunity. Kiva.org relies on Vertica OnDemand to reduce capital costs, leverage the SaaS delivery model to adapt more quickly to changing business requirements, and work with over a million lenders, hundreds of field partners and volunteers, across the world. To see a recorded Webinar with HPE and Kiva.org, see here.

Qx Anything else you wish to add?

Colin Mahony: We appreciate the opportunity to share the features and benefits of HPE Vertica as well as the bright market outlook for data-driven organizations. However, I always recommend that any organization that is struggling with how to get started with their analytics initiative to speak and meet with peers to learn best practices and avoid potential pitfalls. The best way to do that, in my opinion, is to visit with the more than 1,000 Big Data experts in Boston from August 29 – September 1st at the HPE Big Data Conference. Click here to learn more and join us for 40+ technical deep-dive sessions.

————-

Colin Mahony, SVP & General Manager, HPE Big Data Platform

Colin Mahony leads the Hewlett Packard Enterprise Big Data Platform business group, which is responsible for the industry leading Vertica Advanced Analytics portfolio, the IDOL Enterprise software that provides context and analysis of unstructured data, and Haven OnDemand, a platform for developers to leverage APIs and on demand services for their applications.
In 2011, Colin joined Hewlett Packard as part of the highly successful acquisition of Vertica, and took on the responsibility of VP and General Manager for HP Vertica, where he guided the business to remarkable annual growth and recognized industry leadership. Colin brings a unique combination of technical knowledge, market intelligence, customer relationships, and strategic partnerships to one of the fastest growing and most exciting segments of HP Software.

Prior to Vertica, Colin was a Vice President at Bessemer Venture Partners focused on investments primarily in enterprise software, telecommunications, and digital media. He established a great network and reputation for assisting in the creation and ongoing operations of companies through his knowledge of technology, markets and general management in both small startups and larger companies. Prior to Bessemer, Colin worked at Lazard Technology Partners in a similar investor capacity.

Prior to his venture capital experience, Colin was a Senior Analyst at the Yankee Group serving as an industry analyst and consultant covering databases, BI, middleware, application servers and ERP systems. Colin helped build the ERP and Internet Computing Strategies practice at Yankee in the late nineties.

Colin earned an M.B.A. from Harvard Business School and a bachelor’s degrees in Economics with a minor in Computer Science from Georgetown University.  He is an active volunteer with Big Brothers Big Sisters of Massachusetts Bay and the Joey Fund for Cystic Fibrosis.

Resources

What’s in store for Big Data analytics in 2016, Steve Sarsfield, Hewlett Packard Enterprise. ODBMS.org, 3 FEB, 2016

What’s New in Vertica 7.2?: Apache Kafka Integration!, HPE, last edited February 2, 2016

Gartner Says 6.4 Billion Connected “Things” Will Be in Use in 2016, Up 30 Percent From 2015, Press release, November 10, 2015

The Benefits of HP Vertica for SQL on Hadoop, HPE, July 13, 2015

Uplevel Big Data Analytics with Graph in Vertica – Part 5: Putting graph to work for your business , Walter Maguire, Chief Field Technologist, HP Big Data Group, ODBMS.org, 2 Nov, 2015

HP Distributed R ,ODBMS.org,  19 FEB, 2015.

Understanding ROS and WOS: A Hybrid Data Storage Model, HPE, October 7, 2015

Related Posts

On Big Data Analytics. Interview with Shilpa LawandeSource: ODBMS Industry Watch, Published on December 10, 2015

On HP Distributed R. Interview with Walter Maguire and Indrajit RoySource: ODBMS Industry Watch, Published on April 9, 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/03/on-the-internet-of-things-interview-with-colin-mahony/feed/ 0
A Grand Tour of Big Data. Interview with Alan Morrison http://www.odbms.org/blog/2016/02/a-grand-tour-of-big-data-interview-with-alan-morrison/ http://www.odbms.org/blog/2016/02/a-grand-tour-of-big-data-interview-with-alan-morrison/#comments Thu, 25 Feb 2016 15:52:44 +0000 http://www.odbms.org/blog/?p=4087

“Leading enterprises have a firm grasp of the technology edge that’s relevant to them. Better data analysis and disambiguation through semantics is central to how they gain competitive advantage today.”–Alan Morrison.

I have interviewed Alan Morrison, senior research fellow at PwC, Center for Technology and Innovation.
Main topic of the interview is how the Big Data market is evolving.

RVZ

Q1. How do you see the Big Data market evolving? 

Alan Morrison: We should note first of all how true Big Data and analytics methods emerged and what has been disruptive. Over the course of a decade, web companies have donated IP and millions of lines of code that serves as the foundation for what’s being built on top.  In the process, they’ve built an open source culture that is currently driving most big data-related innovation. As you mentioned to me last year, Roberto, a lot of database innovation was the result of people outside the world of databases changing what they thought needed to be fixed, people who really weren’t versed in the database technologies to begin with.

Enterprises and the database and analytics systems vendors who serve them have to constantly adjust to the innovation that’s being pushed into the open source big data analytics pipeline. Open source machine learning is becoming the icing on top of that layer cake.

Q2. In your opinion what are the challenges of using Big Data technologies in the enterprise?

Alan Morrison: Traditional enterprise developers were thrown for a loop back in the late 2000s when it comes to open source software, and they’re still adjusting. The severity of the problem differs depending on the age of the enterprise. In our 2012 issue of the Forecast on DevOps, we made clear distinctions between three age classes of companies: legacy mainstream enterprises, pre-cloud enterprises and cloud natives. Legacy enterprises could have systems that are 50 years old or more still in place and have simply added to those. Pre-cloud enterprises are fighting with legacy that’s up to 20 years old. Cloud natives don’t have to fight legacy and can start from scratch with current tech.

DevOps (dev + ops) is an evolution of agile development that focuses on closer collaboration between developers and operations personnel. It’s a successor to agile development, a methodology that enables multiple daily updates to operational codebases and feedback-response loop tuning by making small code changes and see how those change user experience and behaviour. The linked article makes a distinction between legacy, pre-cloud and cloud native enterprises in terms of their inherent level of agility:

Fig1
 Most enterprises are in the legacy mainstream group, and the technology adoption challenges they face are the same regardless of the technology. To build feedback-response loops for a data-driven enterprise in a legacy environment is more complicated in older enterprises. But you can create guerilla teams to kickstart the innovation process.

Q3. Is the Hadoop ecosystem now ready for enterprise deployment at large scale? 

Alan MorrisonHadoop is ten years old at this point, and Yahoo, a very large mature enterprise, has been running Hadoop on 10,000 nodes for years now. Back in 2010, we profiled a legacy mainstream media company who was doing logfile analysis from all of its numerous web properties on a Hadoop cluster quite effectively. Hadoop is to the point where people in their dens and garages are putting it on Raspberry Pi systems. Lots of companies are storing data in or staging it from HDFS. HDFS is a given. MapReduce, on the other hand, has given way to Spark.

HDFS preserves files in their original format immutably, and that’s important. That innovation was crucial to data-driven application development a decade ago. But Hadoop isn’t the end state for distributed storage, and NoSQL databases aren’t either. It’s best to keep in mind that alternatives to Hadoop and its ecosystem are emerging.

I find it fascinating what folks like LinkedIn and Metamarkets are doing data architecture wise with the Kappa architecture–essentially a stream processing architecture that also works for batch analytics, a system where operational and analytical data are one and the same. That’s appropriate for fully online, all-digital businesses.  You can use HDFS, S3, GlusterFS or some other file system along with a database such as Druid. On the transactional side of things, the nascent IPFS (the Interplanetary File System) anticipates both peer-to-peer and the use of blockchains in environments that are more and more distributed. Here’s a diagram we published last year that describes this evolution to date:
Fig2

From PWC Technology Forecast 2015

People shouldn’t be focused on Hadoop, but what Hadoop has cleared a path for that comes next.

Q4. What are in your opinion the most innovative Big Data technologies?

Alan Morrison: The rise of immutable data stores (HDFS, Datomic, Couchbase and other comparable databases, as well as blockchains) was significant because it was an acknowledgement that data history and permanence matters, the technology is mature enough and the cost is low enough to eliminate the need to overwrite. These data stores also established that eliminating overwrites also eliminates a cause of contention. We’re moving toward native cloud and eventually the P2P fog (localized, more truly distributed computing) that will extend the footprint of the cloud for the Internet of things.

Unsupervised machine learning has made significant strides in the past year or two, and it has become possible to extract facts from unstructured data, building on the success of entity and relationship extraction. What this advance implies is the ability to put humans in feedback loops with machines, where they let machines discover the data models and facts and then tune or verify those data models and facts.

In other words, large enterprises now have the capability to build their own industry- and organization-specific knowledge graphs and begin to develop cognitive or intelligent apps on top those knowledge graphs, along the lines of what Cirrus Shakeri of Inventurist envisions.

Fig3

From Cirrus Shakeri, “From Big Data to Intelligent Applications,”  post, January 2015 

At the core of computable semantic graphs (Shakeri’s term for knowledge graphs or computable knowledge bases) is logically consistent semantic metadata. A machine-assisted process can help with entity and relationship extraction and then also ontology generation.

Computability = machine readability. Semantic metadata–the kind of metadata cognitive computing apps use–can be generated with the help of a well-designed and updated ontology. More and more, these ontologies are uncovered in text rather than hand built, but again, there’s no substitute for humans in the loop. Think of the process of cognitive app development as a continual feedback-response loop process. The use of agents can facilitate the construction of these feedback loops.

Q5. In a recent note Carl Olofson, Research Vice President, Data Management Software Research, IDC, predicted the RIP of “Big Data” as a concept. What is your view on this?

Alan Morrison: I agree the term is nebulous and can be misleading, and we’ve had our fill of it. But that doesn’t mean it won’t continue to be used. Here’s how we defined it back in 2009:

Big Data is not a precise term; rather, it is a characterization of the never-ending accumulation of all kinds of data, most of it unstructured. It describes data sets that are growing exponentially and that are too large, too raw, or too unstructured for analysis using relational database techniques. Whether terabytes or petabytes, the precise amount is less the issue than where the data ends up and how it is used. (See https://www.pwc.com/us/en/technology-forecast/assets/pwc-tech-forecast-issue3-2010.pdf, pg. 6.)

For that issue of the Forecast, we focused on how Hadoop was being piloted in enterprises and the ecosystem that was developing around it. Hadoop was the primary disruptive technology, as well as NoSQL databases. It helps to consider the data challenge of the 2000s and how relational databases and enterprise data warehousing techniques were falling short at that point.  Hadoop has reduced the cost of analyzing data by an order of magnitude and allows processing of very large unstructured datasets. NoSQL has made it possible to move away from rigid data models and standard ETL.

“Big Data” can continue to be shorthand for petabytes of unruly, less structured data. But why not talk about the system instead of just the data? I like the term that George Gilbert of Wikibon latched on to last year. I don’t know if he originated it, but he refers to the System of Intelligence. That term gets us beyond the legacy, pre-web “business intelligence” term, more into actionable knowledge outputs that go beyond traditional reporting and into the realm of big data, machine learning and more distributed systems. The Hadoop ecosystem, other distributed file systems, NoSQL databases and the new analytics capabilities that rely on them are really at the heart of a System of Intelligence.

Q6. How many enterprise IT systems do you think we will need to interoperate in the future? 

Alan Morrison: I like Geoffrey Moore‘s observations about a System of Engagement that emerged after the System of Record, and just last year George Gilbert was adding to that taxonomy with a System of Intelligence. But you could add further to that with a System of Collection that we still need to build. Just to be consistent, the System of Collection articulates how the Internet of Things at scale would function on the input side. The System of Engagement would allow distribution of the outputs. For the outputs of the System of Collection to be useful, that system will need to interoperate in various ways with the other systems.

To summarize, there will actually be four enterprise IT systems that will need to interoperate, ultimately. Three of these exist, and one still needs to be created.

The fuller picture will only emerge when this interoperation becomes possible.

Q7. What are the  requirements, heritage and legacy of such systems?

Alan Morrison: The System of Record (RDBMSes) still relies on databases and tech with their roots in the pre-web era. I’m not saying these systems haven’t been substantially evolved and refined, but they do still reflect a centralized, pre-web mentality. Bitcoin and Blockchain make it clear that the future of Systems of Record won’t always be centralized. In fact, microtransaction flows in the Internet of Things at scale will depend on the decentralized approaches,  algorithmic transaction validation, and immutable audit trail creation which blockchain inspires.

The Web is only an interim step in the distributed system evolution. P2P systems will eventually complemnt the web, but they’ll take a long time to kick in fully–well into the next decade. There’s always the S-curve of adoption that starts flat for years. P2P has ten years of an installed base of cloud tech, twenty years of web tech and fifty years plus of centralized computing to fight with. The bitcoin blockchain seems to have kicked P2P in gear finally, but progress will be slow through 2020.

The System of Engagement (requiring Web DBs) primarily relies on Web technnology (MySQL and NoSQL) in conjunction with traditional CRM and other customer-related structured databases.

The System of Intelligence (requiring Web file systems and less structured DBs) primarily relies on NoSQL, Hadoop, the Hadoop ecosystem and its successors, but is built around a core DW/DM RDBMS analytics environment with ETLed structured data from the System of Record and System of Engagement. The System of Intelligence will have to scale and evolve to accommodate input from the System of Collection.

The System of Collection (requiring distributed file systems and DBs) will rely on distributed file system successors to Hadoop and HTTP such as IPFS and the more distributed successors to MySQL+ NoSQL. Over the very long term, a peer-to-peer architecture will emerge that will become necessary to extend the footprint of the internet of things and allow it to scale.

Q8. Do you already have the piece parts to begin to build out a 2020+ intersystem vision now?

Alan Morrison: Contextual, ubiquitous computing is the vision of the 2020s, but to get to that, we need an intersystem approach. Without interoperation of the four systems I’ve alluded to, enterprises won’t be able to deliver the context required for competitive advantage. Without sufficient entity and relationship disambiguation via machine learning in machine/human feedback loops, enterprises won’t be able to deliver the relevance for competitive advantage.

We do have the piece parts to begin to build out an intersystem vision now. For example, interoperation is a primary stumbling block that can be overcome now. Middleware has been overly complex and inadequate to the current-day task, but middleware platforms such as EnterpriseWeb are emerging that can reach out as an integration fabric for all systems, up and down the stack. Here’s how the integration fabric becomes an essential enabler for the intersystem approach:

Fig4
PwC, 2015

A lot of what EnterpriseWeb (full disclosure: a JBR partner of PwC) does hinges on the creation and use of agents and semantic metadata that enable the data/logic virtualization. That’s what makes the desiloing possible. One of the things about the EnterpriseWeb platform is that it’s a full stack virtual integration and application platform, using methods that have data layer granularity, but process layer impact. Enterprise architects can tune their models and update operational processes at the same time. The result: every change is model-driven and near real-time. Stacks can all be simplified down to uniform, virtualized composable entities using enabling technologies that work at the data layer. Here’s how they work:

Fig5
PwC, 2015

So basically you can do process refinement across these systems, and intersystem analytics views thus also become possible.

Qx anything else you wish to add? 

Alan Morrison: We always quote science fiction writer William Gibson, who said,

“The future is already here — it’s just not very evenly distributed.”

Enterprises would do best to remind themselves what’s possible now and start working with it. You’ve got to grab onto that technology edge and let it pull you forward. If you don’t understand what’s possible, most relevant to your future business success and how to use it, you’ll never make progress and you’ll always be reacting to crises. Leading enterprises have a firm grasp of the technology edge that’s relevant to them. Better data analysis and disambiguation through semantics is central to how they gain competitive advantage today.

We do a ton of research to get to the big picture and find the real edge, where tech could actually have a major business impact. And we try to think about what the business impact will be, rather than just thinking about the tech. Most folks who are down in the trenches are dismissive of the big picture, but the fact is they aren’t seeing enough of the horizon to make an informed judgement. They are trying to use tools they’re familiar with to address problems the tools weren’t designed for. Alongside them should be some informed contrarians and innovators to provide balance and get to a happy medium.

That’s how you counter groupthink in an enterprise. Executives need to clear a path for innovation and foster a healthy, forward-looking, positive and tolerant mentality. If the workforce is cynical, that’s an indication that they lack a sense of purpose or are facing systemic or organizational problems they can’t overcome on their own.

—————–
Alan Morrison (@AlanMorrison) is a senior research fellow at PwC, a longtime technology trends analyst and an issue editor of the firm’s Technology Forecast

Resources

Data-driven payments. How financial institutions can win in a networked economy, BY, Mark Flamme, Partner; Kevin Grieve, Partner;  Mike Horvath, Principal Strategy&. FEBRUARY 4, 2016, ODBMS.org

The rise of immutable data stores, By Alan Morrison, Senior Manager, PwC Center for technology and innovation (CTI), OCTOBER 9, 2015, ODBMS.org

The enterprise data lake: Better integration and deeper analytics, By Brian Stein and Alan Morrison, PwC, AUGUST 20, 2014 ODBMS.org

Related Posts

On the Industrial Internet of Things. Interview with Leon Guzenda , ODBMS Industry Watch, January 28, 2016

On Big Data and Society. Interview with Viktor Mayer-Schönberger , ODBMS Industry Watch, January 8, 2016

On Big Data Analytics. Interview with Shilpa Lawande , ODBMS Industry Watch, December 10, 2015

On Dark Data. Interview with Gideon Goldin , ODBMS Industry Watch, November 16, 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/02/a-grand-tour-of-big-data-interview-with-alan-morrison/feed/ 0
Data for the Common Good. Interview with Andrea Powell http://www.odbms.org/blog/2015/06/data-for-the-common-good-interview-with-andrea-powell/ http://www.odbms.org/blog/2015/06/data-for-the-common-good-interview-with-andrea-powell/#comments Tue, 09 Jun 2015 10:55:08 +0000 http://www.odbms.org/blog/?p=3933

“CABI has a proud history (we were founded in 1910) of serving the needs of agricultural researchers around the world, and it is fascinating to see how technology can now help to achieve our development mission. We can have much greater impact at scale these days on the lives of poor farmers around the world (on whom we are all dependent for our food) by using modern technology and by putting knowledge into the hands of those who need it the most.”–Andrea Powell

I have interviewed Andrea Powell,Chief Information Officer at CABI.
Main topic of the interview is how to use data and knowledge for the Common Good, specifically by solving problems in agriculture and the environment.

RVZ

Q1. What is the main mission of CABI?

Andrea Powell: CABI’s mission is to improve people’s lives and livelihoods by solving problems in agriculture and the environment.
CABI is a not-for-profit, intergovernmental organisation with over 500 staff based in 17 offices around the world. We focus primarily on plant health issues, helping smallholder farmers to lose less of what they grow and therefore to increase their yields and their incomes.

Q2. How effective is scientific publishing in helping the developing world solving agricultural problems?

Andrea Powell: Our role is to bridge the gap between research and practice.
Traditional scientific journals serve a number of purposes in the scholarly communication landscape, but they are often inaccessible or inappropriate for solving the problems of farmers in the developing world. While there are many excellent initiatives which provide free or very low-cost access to the research literature in these countries, what is often more effective is working with local partners to develop and implement local solutions which draw on and build upon that body of research.
Publishers have pioneered innovative uses of technology, such as mobile phones, to ensure that the right information is delivered to the right person in the right format.
This can only be done if the underlying information is properly categorised, indexed and stored, something that publishers have done for many decades, if not centuries. Increasingly we are able to extract extra value from original research content by text and data mining and by adding extra semantic concepts so that we can solve specific problems.

Q3. What are the typical real-world problems that you are trying to solve? Could you give us some examples of your donor-funded development programs?

Andrea Powell: In our Plantwise programme, we are working hard to reduce the crop losses that happen due to the effects of plant pests and diseases. Farmers can typically lose up to 40% of their crop in this way, so achieving just a 1% reduction in such losses could feed 25 million more hungry mouths around the world. Another initiative, called mNutrition, aims to deliver practical advice to farming families in the developing world about how to grow more nutritionally valuable crops, and is aimed at reducing child malnutrition and stunting.

Q4. How do you measure your impact and success?

Andrea Powell: We have a strong focus on Monitoring and Evaluation, and for each of our projects we include a “Theory of Change” which allows us to measure and monitor the impact of the work we are doing. In some cases, our donors carry out their own assessments of our projects and require us to demonstrate value for money in measurable ways.

Q5. What are the main challenges you are currently facing for ensuring CABI’s products and services are fit for purpose in the digital age?

Andrea Powell: The challenges vary considerably depending on the type of customer or beneficiary.
In our developed world markets, we already generate some 90% of our income from digital products, so the challenge there is keeping our products and platforms up-to-date and in tune with the way modern researchers and practitioners interact with digital content. In the developing world, the focus is much more on the use of mobile phone technology, so transforming our content into a format that makes it easy and cheap to deliver via this medium is a key challenge. Often this can take the form of a simple text message which needs to be translated into multiple languages and made highly relevant for the recipient.

Q6. You have one of the world’s largest agricultural database that sits in a RDBMS, and you also have info silos around the company. How do you pull all of these information together?

Andrea Powell: At the moment, with some difficulty! We do use APIs to enable us to consume content from a variety of sources in a single product and to render that content to our customers using a highly flexible Web Content Management System. However, we are in the process of transforming our current technology stack and replacing some of our Relational Databases with MarkLogic, to give us more flexibility and scaleability. We are very excited about the potential this new approach offers.

Q7. How do you represent and model all of this knowledge? Could you give us an idea of how the data management part for your company is designed and implemented?

Andrea Powell: We have a highly structured taxonomy that enables us to classify and categorise all of our information in a consistent and meaningful way, and we have recently implemented a semantic enrichment toolkit, TEMIS Luxid® to make this process even more efficient and automated. We are also planning to build a Knowledge Graph based on linked open data, which will allow us to define our domain even more richly and link our information assets (and those of other content producers) by defining the relationships between different concepts.

Q8. What kind of predictive analytics do you use or plan to use?

Andrea Powell: We are very excited by the prospect of being able to do predictive analysis on the spread of particular crop diseases or on the impact of invasive species. We have had some early investigations into how we can use semantics to achieve this; e.g. if pest A attacks crop B in country C, what is the likelihood of it attacking crop D in country E which has the same climate and soil types as country C?

Q9. How do you intend to implement such predictive analytics?

Andrea Powell: We plan to deploy a combination of expert subject knowledge, data mining techniques and clever programming!

Q10. What are future strategic developments?

Andrea Powell: Increasingly we are developing knowledge-based solutions that focus on solving specific problems and on fitting into user workflows, rather than creating large databases of content with no added analysis or insight. Mobile will become the primary delivery channel and we will also be seeking to use mobile technology to gather user data for further analysis and product development.

Qx Anything else you wish to add?

Andrea Powell: CABI has a proud history (we were founded in 1910) of serving the needs of agricultural researchers around the world, and it is fascinating to see how technology can now help to achieve our development mission. We can have much greater impact at scale these days on the lives of poor farmers around the world (on whom we are all dependent for our food) by using modern technology and by putting knowledge into the hands of those who need it the most.

————–
ANDREA POWELL,Chief Information Officer, CABI, United Kingdom.
I am a linguist by training (French and Russian) with an MA from Cambridge University but have worked in the information industry since graduating in 1988. After two and a half years with Reuters I joined CABI in the Marketing Department in 1991 and have worked here ever since. Since January 2015 I have held the position of Chief Information Officer, leading an integrated team of content specialists and technologists to ensure that all CABI’s digital and print publications are produced on time and to the quality standards expected by our customer worldwide. I am responsible for future strategic development, for overseeing the development of our technical infrastructure and data architecture, and for ensuring that appropriate information & communication technologies are implemented in support of CABI’s agricultural development programmes around the world.

Resources

– More information about how CABI is using MarkLogic can be found in this video, recorded at MarkLogic World San Francisco, April 2015.

Related Posts

Big Data for Good. ODBMS Industry Watch June 4, 2012. A distinguished panel of experts discuss how Big Data can be used to create Social Capital.

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2015/06/data-for-the-common-good-interview-with-andrea-powell/feed/ 0
Big Data and the financial services industry. Interview with Simon Garland http://www.odbms.org/blog/2015/06/big-data-and-the-financial-services-industry-interview-with-simon-garland/ http://www.odbms.org/blog/2015/06/big-data-and-the-financial-services-industry-interview-with-simon-garland/#comments Tue, 02 Jun 2015 07:56:43 +0000 http://www.odbms.org/blog/?p=3911

“The type of data we see the most is market data, which comes from exchanges like the NYSE, dark pools and other trading platforms. This data may consist of many billions of records of trades and quotes of securities with up to nanosecond precision — which can translate into many terabytes of data per day.”–Simon Garland

The topic of my interview with Simon Garland, Chief Strategist at Kx Systems, is Big Data and the financial services industry.

RVZ

Q1. Talking about the financial services industry, what types of data and what quantities are common?

Simon Garland: The type of data we see the most is market data, which comes from exchanges like the NYSE, dark pools and other trading platforms. This data may consist of many billions of records of trades and quotes of securities with up to nanosecond precision — which can translate into many terabytes of data per day.

The data comes in through feed-handlers as streaming data. It is stored in-memory throughout the day and is appended to the on-disk historical database at the day’s end. Algorithmic trading decisions are made on a millisecond basis using this data. The associated risks are evaluated in real-time based on analytics that draw on intraday data that resides in-memory and historical data that resides on disk.

Q2. What are the most difficult data management requirements for high performance financial trading and risk management applications?

Simon Garland: There has been a decade-long arms race on Wall Street to achieve trading speeds that get faster every year. Global financial institutions in particular have spent heavily on high performance software products, as well as IT personnel and infrastructure just to stay competitive. Traders require accuracy, stability and security at the same time that they want to run lightning fast algorithms that draw on terabytes of historical data.

Traditional databases cannot perform at these levels. Column store databases are generally recognized to be orders of magnitude faster than regular RDBMS; and a time-series optimized columnar database is uniquely suited for delivering the performance and flexibility required by Wall Street.

Q3. And why is this important for businesses?

Simon Garland: Orders of magnitude improvements in performance will open up new possibilities for “what-if” style analytics and visualization; speeding up their pace of innovation, their awareness of real-time risks and their responsiveness to their customers.

The Internet of Things in particular is important to businesses who can now capitalize on the digitized time-series data they collect, like from smart meters and smart grids. In fact, I believe that this is only the beginning of the data volumes we will have to be handling in the years to come. We will be able to combine this information with valuable data that businesses have been collecting for decades.

Q4. One of the promise of Big Data for many businesses is the ability to effectively use both streaming data and the vast amounts of historical data that will accumulate over the years, as well as the data a business may already have warehoused, but never has been able to use. What are the main challenges and the opportunities here?

Simon Garland: This can seem like a challenge for people trying to put a system together from a streaming database; an in-memory database from a different vendor, and an historical database from yet another vendor. They then pull data from all of these applications into yet another programming environment. This method cannot give performance and long term is fragile and unmaintainable.

The opportunity here is for a database platform that unifies the software stack, like kdb+, that is robust, easily scalable and easily maintainable.

Q5. How difficult is to combine and process streaming, in-memory and historical data in real time analytics at scale?

Simon Garland: This is an important question. These functionalities can’t be added afterwards. Kdb+ was designed for streaming data, in-memory data and historical data from the beginning. It was also designed with multi-core and multi-process support from the beginning which is essential for processing large amounts of historical data in parallel on current hardware.

We were doing this for decades, even before multi-core machines existed — which is why Wall Street was an early adopter of our technology.

Q6. q programming language vs. SQL: could you please explain the main differences? And also highlight the Pros and cons of each.

Simon Garland: The q programming language is built into the database system kdb+. It is an array programming language that inherently supports the concepts of vectors and column store databases rather than the rows and records that traditional SQL supports.

The main difference is that traditional SQL doesn’t have a concept of order built in, whereas the q programming language does. Unlike traditional SQL, the language q contains a concept of order. This makes complete sense when dealing with time-series data.

Q is intuitive and the syntax is extremely concise, which leads to more productivity, less maintenance and quicker turn-around time.

Q7. Could you give us some examples of successful Big Data real time analytics projects you have been working on?

Simon Garland: Utility applications are using kdb+ for millisecond queries of tables with hundreds of billions of data points captured from millions of smart meters. Analytics on this data can be used for balancing power generation, managing blackouts and for billing and maintenance.

Internet companies with massive amounts of traffic are using kdb+ to analyze Googlebot behavior to learn how to modify pages to improve their ranking. They tell us that traditional databases simply won’t work when they have 100 million pages receiving hundreds of millions of hits per day.

In industries like pharmaceuticals, where decision-making is based on data that can be one day, one week or one month old, our customers and prospects say our column store database makes their legacy data warehouse software obsolete. It is many times faster on the same queries. The time needed for complex analyses on extremely large tables has literally been reduced from hours to seconds.

Q8. Are there any similarities in the way large data sets are used in different vertical markets such as financial service, energy & pharmaceuticals?

Simon Garland: The shared feature is that all of our customers have structured, time-series data. The scale of their data problems are completely different, as are their business use cases. The financial services industry, where kdb+ is an industry standard, demands constant improvements to real-time analytics.

Other industries, like pharma, telecom, oil and gas and utilities, have a different concept of time. They also often are working with smaller data extracts, which they often still consider “Big Data.” When data comes in one day, one week or one month after an event occurred, there is not the same sense of real-time decision making as in finance. Having faster results for complex analytics helps all industries innovate and become more responsive to their customers.

Q9. Anything else you wish to add?

Simon Garland: If we piqued your interest, we have a free, 32-bit version of kdb+ available for download on our web site.

————-
Simon Garland, Chief Strategist, Kx Systems
Simon is responsible for upholding Kx’s high standards for technical excellence and customer responsiveness. He also manages Kx’s participation in the Securities Trading Analysis Center, overseeing all third-party benchmarking.
Prior to joining Kx in 2002, Simon worked at a database search engine company.
Before that he worked at Credit Suisse in risk management. Simon has developed software using kdb+ and q, going back to when the original k and kdb were introduced. Simon received his degree in Mathematics from the University of London and is currently based in Europe.

Resources

LINK to Download of the free 32-bit version of kdb+

Q Tips: Fast, Scalable and Maintainable Kdb+, Author: Nick Psaris

Related Posts

Big Data and Procurement. Interview with Shobhit Chugh. Source: ODBMS Industry Watch, Published on 2015-05-19

On Big Data and the Internet of Things. Interview with Bill Franks. Source: ODBMS Industry Watch, Published on 2015-03-09

On MarkLogic 8. Interview with Stephen Buxton. Source: ODBMS Industry Watch, Published on 2015-02-13

Follow ODBMS.org on Twittwer: @odbmsorg
##

]]>
http://www.odbms.org/blog/2015/06/big-data-and-the-financial-services-industry-interview-with-simon-garland/feed/ 0