“Topdown cataloging and masterdata management tools typically require expensive data curators, and are not simple to use. This poses a significant threat to cataloging efforts since so much knowledge about your organization’s data is inevitably clustered across the minds of the people who need to question it and the applications they use to answer those questions.”–Gideon Goldin
I have interviewed Gideon Goldin, UX Architect, Product Manager at Tamr.
Q1. What is “dark data”?
Gideon Goldin: Gartner refers to dark data as “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing).” For most organizations, dark data comprises the majority of available data, and it is often the result of the constantly changing and unpredictable nature of enterprise data something that is likely to be exacerbated by corporate restructuring, M&A activity, and a number of external factors.
By shedding light on this data, organizations are better suited to make more datadriven, accurate business decisions.
Tamr Catalog, which is available as a free downloadable app, aims to do this, providing users with a view of their entire data landscape so they can quickly understand what was in the dark and why.
Q2. What are the main drawbacks of traditional topdown methods of cataloging or “master data management”?
Gideon Goldin: The main drawbacks are scalability and simplicity. When Yahoo, for example, started to catalog the web they employed some topdown approaches, hiring specialists to curate structured directories of information. As the web grew, however, their solution became less relevant and significantly more costly. Google, on the other hand, mined the web to understand references that exist between pages, allowing the relevance of sites to emerge from the bottomup. As a result, Google’s search engine was more accurate, easier to scale, and simpler.
Topdown cataloging and masterdata management tools typically require expensive data curators, and are not simple to use. This poses a significant threat to cataloging efforts since so much knowledge about your organization’s data is inevitably clustered across the minds of the people who need to question it and the applications they use to answer those questions. Tamr Catalog aims to deliver an innovative and vastly simplified method for cataloging your organization’s data.
Q3. Tamr recently opened a public Beta program Tamr Catalog for an enterprise metadata catalog. What is it?
Gideon Goldin: The Tamr Catalog Beta Program is an open invitation to testdrive our free cataloging software. We have yet to find an organization that is content with their current cataloging approaches, and we found that the biggest barrier to reform is often knowing where to start. Catalog can help: the goal of the Catalog Beta Program is to better understand how people want and need to collaborate around their data sources. We believe that an early partnership with the community will ensure that we develop useful functionality and thoughtful design.
Q4 What are the core functionality of Tamr Catalog?
Gideon Goldin: Tamr Catalog enables users to easily register, discover and organize their data assets.
Q5. How does it help simplify access to highquality data sets for analytics?
Gideon Goldin: Not surprisingly, people are biased to use the data sets closest to them. With Catalog, scientists and analysts can easily discover unfamiliar data setsdata sets, for example, that may belong to other departments or analysts. Catalog profiles and collects pointers to your sources, providing multifaceted and visual browsing of all data trivializing the search for any given set of data.
Q6. How does Tamr Catalog relate to the Tamr Data Unification Platform?
Gideon Goldin: Before organizations can unify their data, preparing it for improved analysis or management, they need to know what they have. Organizations often lack a good approach for this first (and repeating) step in data unification. We realized this quickly when helping large organizations begin their unification projects, and we even realized we lacked a satisfactory tool to understand our own data. Thus, we built Catalog as a part of the Tamr Data Unification Platform to illuminate your data landscape, such that people can be confident that their unification efforts are as comprehensive as possible.
Q7. What are the main challenges (technical and non technical) in achieving a broad adoption of a vendor and platform neutral metadata cataloging?
Gideon Goldin: Often the challenge isn’t about volume, it’s about variety. While a vendor neutral Catalog intends to solve exactly this, there remains a technical challenge in providing a flexible and elegant interface for cataloging dozens or hundreds of different types of data sets and the structures they comprise.
However, we find that some of the biggest (and most interesting) challenges revolve around organizational processes and culture. Some organizations have developed sophisticated but unsustainable approaches to managing their data, while others have become paralyzed by the inherently disorganized nature of their data. It can be difficult to appreciate the value of investing in these problems. Figuring out where to start, however, shouldn’t be difficult. This is why we chose to release a lightweight application free of charge.
Q8. Chief Data Officers (CDOs), data architects and business analysts have different requirements and different modes of collaborating on (shared) data sets. How do you address this in your catalog?
Gideon Goldin: The goal of cataloging isn’t cataloging, it’s helping CDOs identify business opportunities, empowering architects to improve infrastructures, enabling analysts to enrich their studies, and more. Catalog allows anyone to register and organize sources, encouraging open communication along the way.
Q9. How do you handle issues such as data protection, ownership, provenance and licensing in the Tamr catalog?
Gideon Goldin: Catalog allows users to indicate who owns what. Over the course of our Beta program, we have been fortunate enough to have over 800 early users of Catalog and have collected feedback about how our users would like to see data protection and provenance implemented in their own environments. We are eager to release new functionality to address these needs in the near future.
Q10. Do you plan to use the Tamr Catalog also for collecting data sets that can be used for data projects for the Common Good?
Gideon Goldin: We do know of a few instances of Catalog being used for such purposes, including projects that will build on the documenting of city and health data. In addition to our Catalog Beta Program, we are introducing a Community Developer Program, where we are eager to see how the community links Tamr Catalogs to new sources (including those in other catalogs), new analytics and visualizations, and ultimately insights. We believe in the power of open data at Tamr, and we’re excited to learn how we can help the Common Good.
Gideon Goldin, UX Architect, Product Manager at Tamr.
Prior to Tamr, Gideon Goldin worked as a data visualization/UX consultant and university lecturer. He holds a Masters in HCI and a PhD in cognitive science from Brown University, and is interested in designing novel humanmachine experiences. You can reach Gideon on Twitter at @gideongoldin or email him at Gideon.Goldin at tamr.com.
-Tamr Catalog Developer Community
Online community where Tamr catalog users can comment, interact directly with the development team, and learn more about the software; and where developers can explore extending the tool by creating new data connectors.
Follow ODBMs.org on Twitter: @odbmsorg
“While it’s hard to pinpoint all of the key challenges for organizations hoping to effectively deploy their own predictive models, one significant challenge we’ve observed is the lack of C-level buy in.”–John K. Thompson
Q1. What are the key challenges for organizations to effectively deploy predictive models?
John: While it’s hard to pinpoint all of the key challenges for organizations hoping to effectively deploy their own predictive models, one significant challenge we’ve observed is the lack of C-level buy in. One direct example of this was Dell’s recent internal data migration from a legacy platform to its own platform, Statistica. It required major cultural change, involving identifying key change agents among Dell’s executive and senior management teams, who were responsible for enforcing governance as needed. On a technical level, Dell Statistica contains the most sophisticated algorithms for predictive analytics, machine learning and statistical analysis, enabling companies to find meaningful patterns in data. As 44 percent of organizations still don’t understand how to extract value from their data, revealed in Dell’s Global Technology Adoption Index 2015, Dell helps businesses invest wisely in data technologies, such as Statistica, to leverage the power of predictive analytics.
Q2. What is the role of users in running data analytics?
John: End-users turn to data analytics to better understand their businesses, predict change, increase agility and control critical systems through data. Customers use Statistica for predictive modeling, visualizations, text mining and data mining. With Statistica 13’s NDA capabilities, organizations can save time and resources by allowing the analytic processing to take place in the database or Hadoop cluster, rather than pulling data to a server or desktop. With features such as these, businesses can spend more time analyzing and making decisions from their data vs. processing the information.
Q3. What are the key challenges for organizations to embed analytics across core processes?
John: Embedding analytics across an organization’s core processes helps offer analytics to more users and allows it to become more universally accepted throughout the business. One of the largest challenges of embedding analytics is the attempt to analyze unorganized datasets. This can lead to miscategorization of the data, which can eventually result in making inaccurate business decisions. At Dell’s annual conference, Dell World, on October 20, we announced new offerings and capabilities that enable companies to embed analytics across their core processes and disseminate analytics expertise to give scalability to data-based decision making.
Q4. How is analytics related to the Internet of Things?
John: Data analytics and the Internet of Things go hand in hand. In the modern data economy, the ability to gain predictive insight from all data is critical to building an agile, connected and thriving data-driven enterprise. Whether the data comes from real-time sensors from an IoT environment, or a big data platform designed for analytics on massive amounts of disparate data, our new offerings enable detailed levels of insight and action. With the new capabilities and enhancements delivered in Statistica 13, Dell is making it possible for organizations of all sizes to deploy predictive analytics across the enterprise and beyond in a smart, simple and cost-effective manner. We believe this ultimately empowers them to better understand customers, optimize business processes, and create new products and services.
Q5. On big data and analytics Dell has announced new offerings to its end-to-end big data and analytics portfolio. What are these new offerings?
John: Dell is announcing a series of new big data and analytics solutions and services designed to help companies quickly and securely turn data into insights for better, faster decision-making. Statistica 13, the newest version of our advanced analytics software, makes it easier for organizations to deploy predictive models across the enterprise to reveal business and customer insights. Dell Services’ Analytics-as-a-Service offerings target specific industries, including banking and insurance, to provide actionable information, and better understand customers and business processes. Overall, with these enhancements, Dell is making it easier for organizations to understand how to invest in big data technologies and leverage the power of predictive analytics.
Q6. Dell is not a software company. How do you help customers turn data into insights for better decision making?
John: Dell has made great strides in the software industry, and specifically, the big data and analytics space, since our 2014 acquisition of StatSoft. Both Statistica 13 and Dell’s expanded Analytics-as-a-Service offerings help customers better unearth insights, predict business outcomes, and improve accuracy and efficiency of critical business processes. For example, the new analytics-enabled Business Process Outsourcing (BPO) services help organizations deal with fraud, denial likelihood scoring and customer retention. Additionally, the Dell ModelHealth Tracker to helps customers track and monitor the effectiveness of their various predictive analytics models, leading to better business-decision making at every level.
Q7. What are the main advancements to Dell`s analytics platform that you have introduced? And why?
John: The launch of Statistica 13 helps simplify the way organizations of all sizes deploy predictive models directly to data sources inside the firewall, in the cloud and in partner ecosystems. Additionally, Statistica 13 requires no coding and integrates seamlessly with open source R, which helps organizations leverage all data to predict future trends, identify new customers and sales opportunities, explore “what-if” scenarios, and reduce the occurrence of fraud and other business risks. The full list of enhancements include:
- A modernized GUI for greater ease-of-use and visual appeal
- More integration with the recently added Statistica Interactive Visualization and Dashboard engine
- More integration with open source R allowing for more control of R scripts
- A new stepwise model tool that gradually recommends optimum models for users
- New Native Distributed Analytics (NDA) capabilities that allow users to run analytics directly in the database where data lives and work more efficiently with large and growing data sets
Q8. Why did you introduce a new package of analytics-as-a-service offerings for industry verticals?
John: We’re announcing new analytics-as-a-service offerings in the healthcare and financial industries as those are two areas in which we’re seeing not only extreme growth, but an increased willingness and appetite for leveraging predictive analytics. These new services include:
- Fraud, Waste and Abuse Management:Allows businesses to better identify medical identity theft, unnecessary diagnostic services or medically unnecessary services and incorrect billing.
- Denial Likelihood Scoring and Predictive Analytics:Allows business to proactively identify which claims are most likely to be denied while providing at-a-glance activity data on each account. This can help eliminate up to 40 percent of low- or no-value follow-up work.
- Churn Management/Customer Retention Services:Allows businesses to leverage predictive churn modelling. This helps users identify customers they are at risk of losing and proactively take preventative measures.
Q9. Dell has launched a new purpose-built IoT gateway series with analytics capabilities. What is it and what is it useful for?
John: The new Dell Edge Gateway 5000 Series is a solution designed purpose-built for Industrial IoT. Combined with Statistica, the solution promises to give companies an edge computing solution alternative to today’s costly and proprietary IoT offerings. Thanks to new capabilities in Statistica 13, Dell is now expanding analytics to the gateway, allowing companies to extend the benefits of cloud computing to their network edge. In turn, this allows for more secure business insights, and saves companies the costly transfer of data to and to and from the cloud.
Q10. Anything else you wish to add?
John: If you’d like to hear more about what’s coming from Dell Software at Dell World 2015, check our Twitter feed at @DellSoftware for real-time updates.
John K. Thompson
John K. Thompson is the general manager of global advanced analytics at Dell Software. John has 25 years of experience in building and growing technology companies in the information management segment. He has developed and executed plans for overall sales and marketing, product development and market entry. His focus areas are big data, descriptive & predictive analytics, cognitive computing, and data mining. John holds a BS in Computer Science from Ferris State University and a MBA in Marketing from DePaul University.
– Dell Study Reveals Companies Investing in Cloud, Mobility, Security and Big Data Are Growing More Than 50 Percent Faster Than Laggards, Dell Press release, 13 Oct 2015.
Follow ODBMS.org on Twitter: @odbmsorg
“The question of ‘who owns the data’ will undoubtedly add requirements on the underlying service architecture and database, such as the ability to add meta-data relationships representing the provenance or ownership of specific device data.”–Steve Cellini
I have interviewed Steve Cellini, Vice President of Product Management at NuoDB. We covered the challenges and opportunities of The Internet of Things, seen from the perspective of a database vendor.
Q1. What are in your opinion the main Challenges and Opportunities of The Internet of Things (IoT) seen from the perspective of a database vendor?
Steve Cellini: Great question. With the popularity of Internet of Things, companies have to deal with various requirements, including data confidentiality and authentication, access control within the IoT network, privacy and trust among users and devices, and the enforcement of security and privacy policies. Traditional security counter-measures cannot be directly applied to IoT technologies due to the different standards and communication stacks involved. Moreover, the high number of interconnected devices leads to scalability issues; therefore a flexible infrastructure is needed to be able to deal with security threats in such a dynamic environment.
If you think about IoT from a data perspective, you’d see these characteristics:
• Distributed: lots of data sources, and consumers of workloads over that data are cross-country, cross-region and worldwide.
• Dynamic: data sources come and go, data rates may fluctuate as sets of data are added, dropped or moved into a locality. Workloads may also fluctuate.
• Diverse: data arrives from different kinds of sources
• Immediate: some workloads, such as monitoring, alerting, exception handling require near-real-time access to data for analytics. Especially if you want to spot trends before they become problems, or identify outliers by comparison to current norms or for a real-time dashboard.
These issues represent opportunities for the next generation of databases. For instance, the need for immediacy turns into a strong HTAP (Hybrid Transactional and Analytic Processing) requirement to support that as well as the real-time consumption of the raw data from all the devices.
Q2. Among the key challenge areas for IoT are Security, Trust and Privacy. What is your take on this?
Steve Cellini: IoT scenarios often involve human activities, such as tracking utility usage in a home or recording motion received from security cameras. The data from a single device may be by itself innocuous, but when the data from a variety of devices is combined and integrated, the result may be a fairly complete and revealing view of one’s activities, and may not be anonymous.
With this in mind, the associated data can be thought of as “valuable” or “sensitive” data, with attendant requirements on the underlying database, not dissimilar from, say, the kinds of protections you’d apply to financial data — such as authentication, authorization, logging or encryption.
Additionally, data sovereignty or residency regulations may also require that IoT data for a given class of users be stored in a specific region only, even as workloads that consume that data might be located elsewhere, or may in fact roam in other regions.
There may be other requirements, such as the need to be able to track and audit intermediate handlers of the data, including IoT hubs or gateways, given the increasing trend to closely integrate a device with a specific cloud service provider, which intermediates general access to the device. Also, the question of ‘who owns the data’ will undoubtedly add requirements on the underlying service architecture and database, such as the ability to add meta-data relationships representing the provenance or ownership of specific device data.
Q3. What are the main technical challenges to keep in mind while selecting a database for today’s mobile environment?
Steve Cellini: Mobile users represent sources of data and transactions that move around, imposing additional requirements on the underlying service architecture. One obvious requirement is to enable low-latency access to a fully active, consistent, and up-to-date view of the database, for both mobile apps and their users, and for backend workloads, regardless of where users happen to be located. These two goals may conflict if the underlying database system is locked to a single region, or if it’s replicated and does not support write access in all regions.
It can also get interesting when you take into account the growing body of data sovereignty or residency regulations. Even as your users are traveling globally, how do you ensure that their data-at-rest is being stored in only their home region?
If you can’t achieve these goals without a lot of special-case coding in the application, you are going to have a very complex, error-prone application and service architecture.
Q4. You define NuoDB as a scale-out SQL database for global operations. Could you elaborate on the key features of NuoDB?
Steve Cellini: NuoDB offers several key value propositions to customers: the ability to geo-distribute a single logical database across multiple data centers or regions, arbitrary levels of continuous availability and storage redundancy, elastic horizontal scale out/in on commodity hardware, automation, ease and efficiency of multi-tenancy.
All of these capabilities enable operations to cope flexibly, efficiently and economically as the workload rises and dips around the business lifecycle, or expands with new business requirements.
Q5. What are the typical customer demands that you are responding to?
Steve Cellini: NuoDB is the database for today’s on-demand economy. Businesses have to respond to their customers who demand immediate response and expect a consistent view of their data, whether it be their bank account or e-commerce apps — no matter where they are located. Therefore, businesses are looking to move their key applications to the cloud and ensure data consistency – and that’s what is driving the demand for our geo-distributed SQL database.
Q6. Who needs a geo-distributed database? Could you give some example of relevant use cases?
Steve Cellini: A lot of our customers come to us precisely for our geo distributed capability – by which I mean our ability to run a single unified database spread across multiple locations, accessible for querying and updating equally in all those locations. This is important where applications have mobile users, switching the location they connect to. That happens a lot in the telecommuting industry. Or they’re operating ‘follow the sun’ services where a user might need to access any data from anywhere that’s a pattern with global financial services customers. Or just so they can offer the same low-latency service everywhere. That’s what we call “local everywhere”, which means you don’t see increasing delays, if you are traveling further from the central database.
Q7. You performed recently some tests using the DBT2 Benchmark. Why are you using the DBT2 Benchmark and what are the results you obtained so far?
Steve Cellini: The DBT2 (TPC/C) benchmark is a good test for an operational database, because it simulates a real-world transactional workload.
Our focus on DBT2 hasn’t been on achieving a new record for absolute NOTPM rates, but rather to explore one of our core value propositions — horizontal scale out on commodity hardware. We recently passed the 1 million NOTPM mark on a cluster of 50 low-cost machines and we are very excited about it.
Q8. How is your offering in the area of automation, resiliency, and disaster recovery different (or comparable) with some of the other database competitors?
Steve Cellini: We’ve heard from customers who need to move beyond the complexity, pain and cost of their disaster recovery operations, such as expanding from a typical two data center replication operation to three or more data centers, or addressing lags in updates to the replica, or moving to active/active.
With NuoDB, you use our automation capability to dynamically expand the number of hosts and regions a database operates in, without any interruption of service. You can dial in the level of compute and storage redundancy required and there is no single point of failure in a production NuoDB configuration. And you can update in every location – which may be more than two, if that’s what you need.
Steve Cellini VP, Product Management, NuoDB
Steve joined NuoDB in 2014 and is responsible for Product Management and Product Support, as well as helping with strategic partnerships.
In his 30-year career, he has led software and services programs at various companies – from startups to Fortune 500 – focusing on bringing transformational technology to market. Steve started his career building simulator and user interface systems for electrical and mechanical CAD products and currently holds six patents.
Prior to NuoDB, Steve held senior technical and management positions on cloud, database, and storage projects at EMC, Mozy, and Microsoft. At Microsoft, Steve helped launch one of the first cloud platform services and led a company-wide technical evangelism team. Steve has also built and launched several connected mobile apps. He also managed Services and Engineering groups at two of the first object database companies – Ontos (Ontologic) and Object Design.
Steve holds a Sc.B in Engineering Physics from Cornell University.
Follow ODBMS.org on Twitter: @odbmsorg
“Traditional OLAP tools run into problems when trying to deal with massive data sets and high cardinality.”–Ajay Anand
I have interviewed Ajay Anand, VP Product Management and Marketing, Kyvos Insights. Main topic of the interview is big data analytics.
Q1. In your opinion, what are the current main challenges in obtaining relevant insights from corporate data, both structured and unstructured, regardless of size and granularity?
Ajay Anand: We focus on making big data accessible to the business user, so he/she can explore it and decide what’s relevant. One of the big inhibitors to the adoption of Hadoop is that it is a complex environment and daunting for a business user to work with. Our customers are looking for self-service analytics on data, regardless of the size or granularity. A business user should be able to explore the data without having to write code, look at different aspects of the data, and follow a train of thought to answer a business question, with instant, interactive response times.
Q2. What is your opinion about using SQL on Hadoop?
Ajay Anand: SQL is not the most efficient or intuitive way to explore your data on Hadoop. While Hive, Impala and others have made SQL queries more efficient, it can still take tens of minutes to get a response when you are combining multiple data sets and dealing with billions of rows.
Q3. Kyvos Insights emerged a couple of months ago from Stealth mode. What is your mission?
Ajay Anand: Our mission is to make big data analytics simple, interactive, enjoyable, massively scalable and affordable. It should not be just the domain of the data scientist. A business user should be able to tap into the wealth of information and use it to make better business decisions or wait for reports to be generated.
Q4. There are many diverse tools for big data analytics available today. How do you position your new company in the already quite full market for big data analytics?
Ajay Anand: While there are a number of big data analytics solutions available in the market, most customers we have talked to still had significant pain points. For example, a number of them are Tableau and Excel users. But when they try to connect these tools to large data sets on Hadoop, there is a significant performance impact. We eliminate that performance bottleneck, so that users can continue to use their visualization tool of choice, but now with response time in seconds.
Q5. You offer “cubes on Hadoop.” Could you please explain what are such cubes and what are the useful for?
Ajay Anand: OLAP cubes are not a new concept. In most enterprises, OLAP tools are the preferred way to do fast, interactive analytics.
However, traditional OLAP tools run into problems when trying to deal with massive data sets and high cardinality.
That is where Kyvos comes in. With our “cubes on Hadoop” technology, we can build linearly scalable, multi-dimensional OLAP cubes and store them in a distributed manner on multiple servers in the Hadoop cluster. We have built cubes with hundreds of billions of rows, including dimensions with over 300 million cardinality. Think of a cube where you can include every person in the U.S., and drill down to the granularity of an individual. Once the cube is built, now you can query it with instant response time, either from our front end or from traditional tools such as Excel, Tableau and others.
Q6. How do you convert raw data into insights?
Ajay Anand: We can deal with all kinds of data that has been loaded on Hadoop. Users can browse this data, look at different data sets, combine them and process them with a simple drag and drop interface, with no coding required. They can specify the dimensions and measures they are interested in exploring, and we create Hadoop jobs to process the data and build cubes. Now they can interactively explore the data and get the business insights they are looking for.
Q7. A good analytical process can result in poor results if the data is bad. How do you ensure the quality of data?
Ajay Anand: We provide a simple interface to view your data on Hadoop, decide the rules for dropping bad data, set filters to process the data, combine it with lookup tables and do ETL processing to ensure that the data fits within your parameters of quality. All of this is done without having to write code or SQL queries on Hadoop.
Q8. How do you ensure that the insights you obtained with your tool are relevant?
Ajay Anand: The relevance of the insights really depends on your use case. Hadoop is a flexible and cost-effective environment, so you are not bound by the constraints of an expensive data warehouse where any change is strictly controlled. Here you have the flexibility to change your view, bring in different dimensions and measures and build cubes as you see fit to get the insights you need.
Q9. Why do technical and/or business users want to develop multi-dimensional data models from big data, work with those models interactively in Hadoop, and use slice-and-dice methods? Could you give us some concrete examples?
Ajay Anand: An example of a customer that is using us in production to get insights on customer behavior for marketing campaigns is a media and entertainment company addressing the Latino market. Before using big data, they used to rely on surveys and customer diaries to track viewing behavior. Now they can analyze empirical viewing data from more than 20 million customers, combine it with demographic information, transactional information, geographic information and many other dimensions. Once all of this data has been built into the cube, they can look at different aspects of their customer base with instant response times, and their advertisers can use this to focus marketing campaigns in a much more efficient and targeted manner, and measure the ROI.
Q10. Could you share with us some performance numbers for Kyvos Insights?
Ajay Anand: We are constantly testing our product with increasing data volumes (over 50 TB in one use case) and high cardinality. One telecommunications customer is testing with subscriber information that is expected to grow to several trillion rows of data. We are also testing with industry standard benchmarks such as TPC-DS and the Star Schema Benchmark. We find that we are getting response times of under two seconds for queries where Impala and Hive take multiple minutes.
Q11. Anything else you wish to add?
Ajay Anand: As big data adoption enters the mainstream, we are finding that customers are demanding that analytics in this environment be simple, responsive and interactive. It must be usable by a business person who is looking for insights to aid his/her decisions without having to wait for hours for a report to run, or be dependent on an expert who can write map-reduce jobs or Hive queries. We are moving to a truly democratized environment for big data analytics, and that’s where we have focused our efforts with Kyvos.
Ajay Anand is vice president of products and marketing at Kyvos Insights, delivering multi-dimensional OLAP solutions that run natively on Hadoop. Ajay has more than 20 years of experience in marketing, product management and development in the areas of big data analytics, storage and high availability clustered systems.
Prior to Kyvos Insights, he was founder and vice president of products at Datameer, delivering the first commercial analytics product on Hadoop. Before that he was director of product management at Yahoo, driving adoption of the Hadoop based data analytics infrastructure across all Yahoo properties. Previously, Ajay was director of product management and marketing for SGI’s Storage Division. Ajay has also held a number of marketing and product management roles at Sun, managing teams and products in the areas of high availability clustered systems, systems management and middleware.
Ajay earned an M.B.A. and an M.S. in computer engineering from the University of Texas at Austin, and a BSEE from the Indian Institute of Technology.
Follow ODBMS.org on Twitter: @odbmsorg
“Many companies don’t use analytical fraud detection techniques yet. In fact, most still rely on an expert based approach, meaning that they build upon the experience, intuition and business knowledge of the fraud analyst.” –Bart Baesens
Q1. What is exactly Fraud Analytics?
Good question! First of all, in our book we define fraud as an uncommon, well-considered, imperceptibly concealed, time-evolving and often carefully organized crime which appears in many types of forms. The idea of using analytics for fraud detection is catalyzed by the enormous amount of data which is currently being generated in any business process. Think about insurance claim handling, credit card transactions, cash transfers, tax payments, etc. to name a few. In our book, we discuss various ways of analyzing these massive data sets in a descriptive, predictive or social network way to come up with new analytical fraud detection models.
Q2. What are the main challenges in Fraud Analytics?
The definition we gave above highlights the 5 key challenges in fraud analytics. The first one concerns the fact that fraud is uncommon. Independent of the exact setting or application, only a minority of the involved population of cases typically concerns fraud, of which furthermore only a limited number will be known to concern fraud. This seriously complicates the estimation of analytical models.
Fraudsters try to blend into the environment and not behave different from others in order not to get noticed and to remain covered by non-fraudsters. This effectively makes fraud imperceptibly concealed, since fraudsters do succeed in hiding by well considering and planning how to precisely commit fraud.
Fraud detection systems improve and learn by example. Therefore the techniques and tricks fraudsters adopt evolve in time along with, or better ahead of fraud detection mechanisms. This cat and mouse play between fraudsters and fraud fighters may seem to be an endless game, yet there is no alternative solution so far. By adopting and developing advanced analytical fraud detection and prevention mechanisms, organizations do manage to reduce losses due to fraud since fraudsters, like other criminals, tend to look for the easy way and will look for other, easier opportunities.
Fraud is typically a carefully organized crime, meaning that fraudsters often do not operate independently, have allies, and may induce copycats. Moreover, several fraud types such as money laundering and carousel fraud involve complex structures that are set up in order to commit fraud in an organized manner. This makes fraud not to be an isolated event, and as such in order to detect fraud the context (e.g., the social network of fraudsters) should be taken into account. This is also extensively discussed in our book.
A final element in the description of fraud provided in our book indicates the many different types of forms in which fraud occurs. This both refers to the wide set of techniques and approaches used by fraudsters as well as to the many different settings in which fraud occurs or economic activities that are susceptible to fraud.
Q3. What is the current state of the art in ensuring early detection in order to mitigate fraud damage?
Many companies don’t use analytical fraud detection techniques yet. In fact, most still rely on an expert based approach, meaning that they build upon the experience, intuition and business knowledge of the fraud analyst. Such an expert-based approach typically involves a manual investigation of a suspicious case, which may have been signaled for instance by a customer complaining of being charged for transactions he did not do. Such a disputed transaction may indicate a new fraud mechanism to have been discovered or developed by fraudsters, and therefore requires a detailed investigation for the organization to understand and subsequently address the new mechanism.
Comprehension of the fraud mechanism or pattern allows extending the fraud detection and prevention mechanism which is often implemented as a rule base or engine, meaning in the form of a set of IF-THEN rules, by adding rules that describe the newly detected fraud mechanism. These rules, together with rules describing previously detected fraud patterns, are applied to future cases or transactions and trigger an alert or signal when fraud is or may be committed by use of this mechanism. A simple, yet possibly very effective example of a fraud detection rule in an insurance claim fraud setting goes as follows:
- Amount of claim is above threshold OR
- Severe accident, but no police report OR
- Severe injury, but no doctor report OR
- Claimant has multiple versions of the accident OR
- Multiple receipts submitted
- Flag claim as suspicious AND
- Alert fraud investigation officer
Such an expert approach suffers from a number of disadvantages. Rule bases or engines are typically expensive to build, since requiring advanced manual input by the fraud experts, and often turn out to be difficult to maintain and manage. Rules have to be kept up to date and only or mostly trigger real fraudulent cases, since every signaled case requires human follow-up and investigation. Therefore the main challenge concerns keeping the rule base lean and effective, in other words deciding upon when and which rules to add, remove, update, or merge.
By using data-driven analytical models such as descriptive, predictive or social network analytics in a complimentary way, we can improve the performance of our fraud detection approaches in terms of precision, cost efficiency and operational effectiveness.
Q4. Is early detection all that can be done? Are there any other advanced techniques that can be used?
You can do more than just detection. More specifically, two components that are essential parts of almost any effective strategy to fight fraud concern fraud detection and fraud prevention. Fraud detection refers to the ability to recognize or discover fraudulent activities, whereas fraud prevention refers to measures that can be taken aiming to avoid or reduce fraud. The difference between both is clear-cut, the former is an ex post approach whereas the latter an ex ante approach. Both tools may and likely should be used in a complementary manner to pursue the shared objective, being fraud reduction. However, as also discussed in our book, preventive actions will change fraud strategies and consequently impact detection power. Installing a detection system will cause fraudsters to adapt and change their behavior, and so the detection system itself will impair eventually its own detection power. So although complementary, fraud detection and prevention are not independent and therefore should be aligned and considered a whole.
Q5. How do you examine fraud patterns in historical data?
You can examine it in two possible ways: descriptive or predictive. Descriptive analytics or unsupervised learning aims at finding unusual anomalous behavior deviating from the average behavior or norm. This norm can be defined in various ways. It can be defined as the behavior of the average customer at a snapshot in time, or as the average behavior of a given customer across a particular time period, or as a combination of both. Predictive analytics or supervised learning assumes the availability of a historical data set with known fraudulent transactions. The analytical models built can thus only detect fraud patterns as they occurred in the past. Consequently, it will be impossible to detect previously unknown fraud. Predictive analytics can however also be useful to help explain the anomalies found by descriptive analytics.
Q6. How do you typically utilize labeled, unlabeled, and networked data for fraud detection?
Labeled observations or transactions can be analyzed using predictive analytics. Popular techniques here are linear/logistic regression, neural networks and ensemble methods such as random forests. These techniques can be used to predict both fraud incidence, which is a classification problem, as well as fraud intensity, which is a classical regression problem. Unlabeled data can be investigated using descriptive analytics. As said, the aim here is to detect anomalies deviating from the norm. Popular techniques here are: break point analysis, peer group analysis, association rules and clustering. Networked data can be analyzed using social network techniques. We found those to be very useful in our research. Popular techniques here are community detection and featurization. In our research, we developed GOTCHA!, a supervised social network learner for fraud detection. This is also extensively discussed in our book.
Q6. Fraud techniques change over time. How do you handle this?
Good point! A key challenge concerns the dynamic nature of fraud. Fraudsters try to constantly out beat detection and prevention systems by developing new strategies and methods. Therefore adaptive analytical models and detection and prevention systems are required, in order to detect and resolve fraud as soon as possible. Detecting fraud as early as possible is crucial. Hence, we also discuss how to continuously backtest analytical fraud detection models. The key idea here is to verify whether the fraud model still performs satisfactory. Changing fraud tactics creates concept drift implying that the relationship between the target fraud indicator and the data available changes on an on-going basis. Hence, it is important to closely follow-up the performance of the analytical model such that concept drift and any related performance deviation can be detected in a timely way. Depending upon the type of model and its purpose (e.g. descriptive or predictive), various backtesting activities can be undertaken. Examples are backtesting data stability, model stability and model calibration.
Q7. What are the synergies between Fraud Analytics and CyberSecurity?
Fraud analytics creates both opportunities as well as threats for cybersecurity. Think about intrusion detection as an example Predictive methods can be adopted to study known intrusion patterns, whereas descriptive methods or anomaly detection can identify emerging cyber threats. The emergence of the Internet of Things (IoT) will certainly exacerbate the importance of fraud analytics for cybersecurity. Some examples of new fraud treats are:
- Fraudsters might force access to web configurable devices (e.g. Automated Teller Machines (ATMs)) and set up fraudulent transactions;
- Device hacking whereby fraudsters change operational parameters of connected devices (e.g. smart meters are manipulated to make them under register actual usage);
- Denial of Service (DoS) attacks whereby fraudsters massively attack a connected device to stop it from functioning;
- Data breach whereby a user’s log in information is obtained in a malicious way resulting into identity theft;
- Gadget fraud also referred to as gadget lust whereby fraudsters file fraudulent claims to either obtain a new gadget or free upgrade;
- Cyber espionage whereby exchanged data is eavesdropped by an intelligence agency or used by a company for commercial purposes.
More than ever before, fraud will be dynamic and continuously changing in an IoT context. From an analytical perspective, this implies that predictive techniques will continuously lag behind since they are based on a historical data set with known fraud patterns. Hence, as soon as the predictive model has been estimated, it will become outdated even before it has been put into production. Descriptive methods such as anomaly detection, peer group and break point analysis will gain in importance. These methods should be capable of analyzing evolving data streams and perform incremental learning to deal with concept drift. To facilitate (near) real-time fraud detection, the data and algorithms should be processed in-memory instead of relying on slow secondary storage. Furthermore, based upon the results of these analytical models, it should be possible to take fully automated actions such as the shutdown of a smart meter or ATM.
Qx Anything else you wish to add?
We are happy to refer to our book for more information. We also value your opinion and look forward to receiving any feedback (both positive and negative)!
Professor Bart Baesens is a professor at KU Leuven (Belgium), and a lecturer at the University of Southampton (United Kingdom). He has done extensive research on big data & analytics, customer relationship management, web analytics, fraud detection, and credit risk management. His findings have been published in well-known international journals and presented at international top conferences. He is also author of the books Analytics in a Big Data World (goo.gl/k3kBrB), and Fraud Analytics using Descriptive, Predictive and Social Network Techniques (http://goo.gl/nlCjUr). His research is summarised at www.dataminingapps.com. He is also teaching the E-learning course, Advanced Analytics in a Big Data World, see http://goo.gl/WibNPF. He also regularly tutors, advises and provides consulting support to international firms with respect to their analytics and credit risk management strategy.
–Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection (Wiley and SAS Business Series). Authors: Bart Baesens ,Veronique Van Vlasselaer,Wouter Verbeke.
Series: Wiley and SAS Business Series, Hardcover: 400 pages. Publisher: Wiley; 1 edition, September 2015. ISBN-10: 1119133122
–Fraud Analytics:Using Supervised, Unsupervised and Social Network Learning Techniques. Authors: Bart Baesens, Véronique Van Vlasselaer, Wouter Verbeke
Publisher: Wiley 256 pages
ISBN-13: 978-1119133124 | ISBN-10: 1119133122
– Critical Success Factors for Analytical Models: Some Recent Research Insights. Bart Baesens, ODBMS.org,
27 APR, 2015
– Analytics in a Big Data World: The Essential Guide to Data Science and its Applications. Bart Baesens, ODBMS.org, 30 APR, 2014
–The threat from AI is real, but everyone has it wrong, Robert Munro, CEO Idibon. ODBMS.org
Follow ODBMS.org on Twitter: @odbmsorg
“The best way to define Big Data ROI is to look at how our customers define it and benefit from Hadoop.
Wellcare has been able to improve its query speeds from 30 days to just 7 days. This acceleration enabled the Company to increase its analytics and operational reporting by 73%.”–Lawrence Schwartz
Q1. What are the common challenges that enterprises face when trying to use Hadoop?
Lawrence Schwartz: The advent of Hadoop and Big Data has significantly changed the way organizations handle data. There’s a need now for new skills, new organizational processes, new strategies and technologies to adapt to the new playing field. It’s a change that permeates everywhere from how you touch the data, to how much you can support resource-wise and architecturally, to how you manage it and use it to stay competitive. Hadoop itself presents two primary challenges. First, the data has to come from somewhere. Enterprises must efficiently load high volumes of widely-varied data in a timely fashion. We can help with software that enables automated bulk loading into Hadoop without manual coding, and change data capture for efficient updates. The second challenge is finding engineers and Data Scientists with the right skills to exploit Hadoop. Talent is scarce in this area.
Q2. Could you give us some examples of how your customers use Hadoop for their businesses?
Lawrence Schwartz: We have an interesting range of customers using Hadoop, so I’ll provide three examples. One major cable provider we are working with uses Hadoop as a data lake. They are integrating feeds from 200 data stores into Pivotal HD. This data lake includes fresh enterprise data – fed in real-time, not just as an archival area – to run up-to-date reporting and analytics without hitting key transactional systems. This enables them to improve decision support and gain competitive advantage.
Another example of how our customers are using Hadoop highlights a Fortune 50 high technology manufacturer. This customer’s business analytics requirements were growing exponentially, straining IT resources, systems and budgets.
The company selected Attunity Visibility to help it better understand its enterprise-wide data usage analytics across its various data platforms.
Having this capability enables the company to optimize business performance and maximize its investment in its Hadoop, data warehouse and business analytics systems. Attunity Visibility has helped to improve the customer’s system throughput by 25% enabling them to onboard new analytic applications without increasing investment in data warehouse infrastructure.
The third example is a financial services institution. This customer has many different data sources, including Hadoop, and one of its key initiatives is to streamline and optimize fraud detection. Using a historical analysis component, the organization would monitor real-time activity against historical trends to detect any suspicious activity. For example, if you go to a grocery store outside of your normal home ZIP code one day and pay for your goods with a credit card, this could trigger an alert at your bank. The bank would then see that you historically did not use your credit card at that retailer, prompting them to put a hold on your card, but potentially preventing a thief from using your card unlawfully. Using Attunity to leverage both historical and real-time transactions in its analytics, this company is able to decrease fraud and improve customer satisfaction.
Q3. How difficult is it to perform deep insight into data usage patterns?
Lawrence Schwartz: Historically, enterprises just haven’t had the tools to efficiently understand how datasets and data warehouse infrastructure are being used. We provide Visibility software that uniquely enables organizations to understand how tables and other Data Warehouse components are being used by business lines, departments, organizations etc. It continuously collects, stores, and analyzes all queries and applications against data warehouses. They are then correlated with data usage and workload performance metrics in a centralized repository that provides detailed usage and performance metrics for the entire data warehouse. With this insight, organizations can place the right data on the right platform at the right time. This can reduce the cost and complexity of managing multiple platforms.
Q4. Do you believe that moving data across platforms is a feasible alternative for Big Data?
Lawrence Schwartz: It really has to be, because nearly every enterprise has more than one platform, even before Hadoop is considered in the mix. Having multiple types of platforms also yields the benefits and challenges of trying to tier data based on its value, between data warehouses, Hadoop, and cloud offerings. Our customers rely on Attunity to help them with this challenge every day. Moving heterogeneous data in many different formats, and from many different sources is challenging when you don’t have the right tools or resources at your disposal. The problem gets magnified when you’re under the gun to meet real-time SLAs. In order to be able to do all of that well, you need to have a way to understand what data to move, and how to move the data easily, seamlessly and in a timely manner. Our solutions make the whole process of data management and movement automated and seamless, and that’s our hallmark.
Q5. What is “Application Release Automation” and why is it important for enterprises?
Lawrence Schwartz: Application release automation (ARA) solutions are a proven way to support Agile development, accelerate release cycles, and standardize deployment processes across all tiers of the application and content lifecycles. ARA solutions can be used to support a wide variety of activities, ranging from publishing and modifying web site content to deploying web-based tools, distributing software to business end users, and moving code between Development, Test, and Production environments.
Attunity addresses this market with an automation platform for enterprise server, web operations, shared hosting, and data center operations teams. Attunity ARA solutions are designed to offload critical, time-consuming deployment processes in complex enterprise IT environments. Enterprises that adopt ARA solutions enjoy greater business flexibility, improved productivity, better cross-team collaboration, and improved consistency.
Q6. What is your relationships with other Hadoop vendors?
Lawrence Schwartz : Attunity has great working partnerships with all of the major Hadoop platform vendors, including Cloudera, Hortonworks, Pivotal and MapR. We have terrific synergy and work together towards a common goal – to help our customers meet the demands of a growing data infrastructure, optimize their Big Data environments, and make onboarding to Hadoop as easy as possible. Our solutions are certified with each of these vendors, so customers feel confident knowing that they can rely on us to deliver a complete and seamless joint solution for Hadoop.
Q7. Attunity recently acquired Appfluent Technology, Inc. and BIReady. Why Appfluent Technology? Why BIReady? How do these acquisitions fit into Attunity`s overall strategy?
Lawrence Schwartz: When we talk with enterprises today, we hear about how they are struggling to manage mountains of growing data and looking for ways to make complex processes easier. We develop software and acquire companies that help our customers streamline and optimize existing systems as well as scale to meet the growing demands of business.
Appfluent brings the Visibility software I described earlier. With Visibility, companies can rebalance data to improve performance and cost in high-scale, rapidly growing environments. They also can meet charge-back, show-back and audit requirements.
BIReady, now known as Attunity Compose, helps enterprises build and update data warehouses more easily. Data warehouse creation and administration is among the most labor-intensive and time-consuming aspects of analytics preparation. Attunity Compose overcomes the complexity with automation, using significantly less resources. It automatically designs, generates and populates enterprise data warehouses and data marts, adding data modeling and structuring capabilities inside the data warehouse.
Q8. How do you define Big Data ROI?
Lawrence Schwartz: The best way to define this is to look at how our customers define it and benefit from Hadoop.
One of our Fortune 500 customers is Wellcare, which provides managed care services to government-sponsored healthcare programs like Medicaid and Medicare. Wellcare plans to use our software to load data from its Pivotal data warehouse into Hadoop, where they will do much of their data processing and transformations. They will then move a subset of that data from Hadoop back into Pivotal and run their analytics from there. So in this case Hadoop is a staging area. As a result of implementing the first half of this solution (moving data from various databases into Pivotal), Wellcare has been able to improve its query speeds from 30 days to just 7 days. This acceleration enabled the Company to increase its analytics and operational reporting by 73%. At the same time, the solution helps Wellcare meet regulatory requirements in a timely manner more easily, ensuring that it receives the state and federal funding required to run efficiently and productively.
In another example, one of our customers, a leading online travel services company, was dealing with exploding data volumes, escalating costs and an insatiable appetite for business analytics. They selected Attunity Visibility to reduce costs and improve information agility by offloading data and workload from their legacy data warehouse systems to a Hadoop Big Data platform. Attunity Visibility has saved the company over $6 million in two years by ensuring that the right workload and data are stored and processed on the most cost-effective platform based on usage.
Follow ODBMS.org on Twitter: @odbmsorg
“Our evaluation of IMDSs determined that eXtremeDB-64 IMDS outperformed other IMDSs in terms of performance and scalability.”–Manjul Maharishi.
I have interviewed Manjul Maharishi, Vice President (telecom software development) at Transaction Network Services.
They use In-Memory Database technology for managing real-time community networks in the world.
Q1. What is the mission of Transaction Network Services (TNS)?
Manjul Maharishi: Transaction Network Services manages many of the largest real-time community networks in the world, enabling industry participants to simply and securely interact and transact with other businesses, to access the data and applications they need, over managed and secure communications platforms. TNS’ existing footprint supports millions of connections and access to critical databases, enabling its customers through a single connection, a “one-to-many and many-to-many” global platform, securely blending private and public networking.
Q2. What is TNS’s Carrier ENUM Registry? And for what is it useful for?
Manjul Maharishi: Carrier ENUM Registry is a product offering for telecom carriers that provides information critical to the accurate routing and billing of inter-carrier communications, such as voice and mobile data services.
Carrier ENUM Registry addresses a challenge that is posed every time you place a phone call or send a text message: how, in the split second of latency that is deemed acceptable, will the call or message find the way to its recipient?
As a solution, Carrier ENUM Registry makes available an up-to-date, portability-corrected image of the entire public dial plan as well as authoritative information sourced directly from the service provider that “owns” (in telecom parlance, has the “right-to-use”) a particular telephone number. This is provided in the form of two registries, or databases:
• Number Identity Registry is a massive repository of global telephone numbers and carrier-of-record information that identifies which service provider a telephone number was allocated to for end-user assignment. In response to lookups the registry returns a Carrier Identifier (which can be in the form of a Service Provider Identifier (SPID), and/or a Mobile Country Code+Mobile Network Code(MCC+MNC)) and when available, the Location Routing Number (LRN) of ported and pooled numbers.
• Network Routing Directory is a multi-party shared registration system that furnishes service providers with sophisticated data-sharing capabilities featuring safeguard controls designed to uphold data-sharing policies. Using our secure portal service, providers self-administer data and selectively grant access, in whole or part, to trading partners (and vice versa).
Q3. Who are the customers using Carrier ENUM Registry?
Manjul Maharishi: The customers are telecom carriers – large, small and in-between – worldwide.
They are mobile, landline and IP-based, including some ISPs (Internet Service Providers) and cable MSOs (multiple system operators) that offer phone service, as well as “pure” VoIP providers. Most query the Carrier ENUM Registry deployment hosted at TNS’ facility in the US, but some host the application and database on their own premises.
Q4. What services do you support?
Manjul Maharishi: Carrier ENUM emerged as a service to connect the public switched telephone network (PSTN) and new IP-based networks, by resolving phone numbers to IP addresses and services. It also provided a bridge between IP-based carriers. For example, with a multi-vendor database of routes, users and phone numbers available, a caller on IP-based Network A could communicate with a user of IP-based Network B without routing calls across the PSTN (which would incur costs and may require avoidable transcoding).
Over time, though, Carrier ENUM Registry has gained complexity along with new features, and does much more than bridging between carriers. Supported services now include number portability, IP-peering between telephone service providers, SMS/MMS (aka “text messages”) routing, unbundling of services (allowing messaging to be offered separately from voice, for example), customized views of data, routing based on time/destination/origination, and more. These services have added complexity to TNS’ Carrier ENUM Registry business logic and have caused its databases to grow larger and the routing logic to become more complex.
Carriers can pick and choose from the various Carrier ENUM Registry features, to solve their particular challenges.
One of the biggest use cases in demand now is identifying the right carrier to terminate an SMS or MMS when number portability is involved in the host country.
Q5. What kind of real-time performance demands does Carrier ENUM Registry need to satisfy?
Manjul Maharishi: To customers, we commit to providing a response from our system within 10 milliseconds for 95% of the queries. Please note that a single customer query can result in dozens to a couple of hundred individual table queries based on the routing logic and services subscribed. However, largely through the use of in-memory database system (IMDS) technology for data management, we have been able to have a much lower variance in the query responses and a higher degree of predictability. Our typical average response to a customer query is less than 2 msec. These numbers only reflect the latency introduced by our platform, i.e. the time difference between when we receive the query and when we respond back.
The network latency – the time when the query leaves the customer network and when they receive the response – is larger (typical US cross-country network latency is 60-100 msec). An industry norm for the maximum acceptable time from when a subscriber dials digits to when they hear a ringing tone back is ~150-200 msecs, beyond which the “dead air/silence” becomes noticeable for the subscriber. However, for international calls, people do tend to be more tolerant of such post-dial delays.
Q6. Can you give an overview of the system architecture and toolset used to handle the increasingly complex business logic and growing data volume?
Manjul Maharishi: In order to handle the growing amount of stored data, we use general-purpose off-the-shelf Linux servers. This allows us to take advantage of industry-wide gains in processing power, memory and performance, as well as eliminate any dependence upon specific vendors for a software/hardware upgrade cycle. Currently, the systems are running on dual CPU, 6- and 8-core processors.
For data management we use eXtremeDB-64, the 64-bit edition of McObject’s eXtremeDB In-Memory Database System (IMDS). The system is architected such that each server stores the entire database, and customer queries are load balanced across a set of such servers. Accordingly, the platform is easily scaled by adding new servers as needed. Apart from offering this service as a cloud-based offering (“Central Replica”), we also offer the service as customer-premise deployment model (“Local Replica”) whereby the customers can gain from a much lower round-trip time (RTT) by avoiding network-latency. The TNS network operations center (NOC) monitors key performance indicators of our Central and Local Replica servers on a 24×7 basis, and we have agreements in place with our customers to scale up the platform by adding more servers if needed.
With the performance provided by eXtremeDB-64, we haven’t had a need to partition the data set in order to meet our commitments. We do use the database system’s Patricia trie indices to reduce the number of lookups required on certain tables, and work through the business logic to narrow down the search results to a manageable number early in the business rules processing.
In terms of development tools, we are developing in C++ using eXtremeDB’s C/C++ API instead of accessing via an SQL API, and this contributes to lower application latency. We develop software using Agile methodology with Continuous Integration that has nightly builds with a suite of automated tests executing during these builds. We also incorporate code coverage, leak detection and profiling as part of this Continuous Integration.
Q7. Can you tell more about how Carrier ENUM Registry meets its real-time data access requirements? Did it move to in-memory database technology recently or has this always been a feature?
Manjul Maharishi: The system architecture keeps data needed for real-time queries in memory, where it can be accessed quickly. Early versions of Carrier ENUM Registry accomplished searches using in-memory database code developed in-house for the application. However, TNS recognized several years ago that with the increasingly complex queries and higher data volumes, Carrier ENUM Registry would be better-served by an off-the-shelf in-memory database system (IMDS) that provides flexibility while scaling to hundreds of millions and even billions of records. After researching IMDSs, we chose the 64-bit eXtremeDB-64 and the new Carrier ENUM Registry version incorporating eXtremeDB-64 launched in 2013.
Currently, the system holds a master or archival data set in Oracle Enterprise DBMSs, with the data used for real-time lookups hosted “downstream” in eXtremeDB-64. Each downstream server hosts the entire data set used by the application; this data set consists of three separate (i.e. with unique schemas and data) databases with a combined size of 120 GB.
Two of the databases managed by eXtremeDB-64 on each server are “pure” in-memory databases while the third utilizes McObject’s eXtremeDB Fusion technology to include some persistent (on-disk) data storage.
Q8. Why did you choose eXtremeDB-64 from the field of available IMDSs and what has been your experience been using it?
Manjul Maharishi: Our evaluation of IMDSs determined that eXtremeDB-64 IMDS outperformed other IMDSs in terms of performance and scalability. Among other test findings, TNS determined that eXtremeDB’s performance exceeded 2 million queries per second with a 10 million-row database. When TNS upped the challenge by increasing the test database size 3000% (to 300 million records), eXtremeDB’s responsiveness fell only minimally, validating the near-linear scalability results documented in McObject’s published benchmarks. TNS’ platform for these tests consisted of Intel Xeon X5570 2.93 GHz hardware, with 8 cores and hyper-threading enabled, running Red Had Enterprise Linux 4, with 72 GB RAM.
Using eXtremeDB-64 in production with Carrier ENUM Registry has borne out our expectations: the database system meets current needs while providing room for future growth in both database size and complexity of application features.
Q9. You mentioned the use of the Patricia trie index in your database. Can you elaborate on the advantage it provides?
Manjul Maharishi: Support for the Patricia trie database index is another key eXtremeDB-64 feature (along with in-memory data storage) that enables Carrier ENUM Registry to meet its performance goals. The name of this specialized index derives from “Practical Algorithm To Retrieve Information Coded In Alphanumeric” and “reTRIEval”. Unlike the widely used B-tree index – which can also be used for finding keys with a specified prefix but can require multiple iterations to find the longest prefix match if there are multiple prefix matches in the index – the Patricia trie excels in searching for the longest prefixes of a specified value.
This approach meshes well with the unique nature of Carrier ENUM Registry’s data, and its queries. Phone numbers serve as keys for the searches that are performed when a call is placed. The key is stored on individual numbers, blocks or ranges.
A block consists of phone numbers in a quantity ranging from 1,000 to 10,000. For example, it could be the number 703667 and four additional digits ranging from 0000 to 9999. A range is a subset of a block. The key would be applied to an individual number, for example, when that number was ported from another carrier and does not fall into a large block of numbers serviced by the company using TNS’ application.
In most of the cases, it is not known beforehand if the number being queried has been “ported out” of a block or not, so the application (in the absence of the Patricia trie) would have to make multiple queries – starting with the most specific match and then dropping the least-significant digit one-by-one till a match was found. With the Patricia trie, there is only one iteration within the original query, which is much less taxing in performance terms and greatly simplifies the application logic.
Q10. Are there other aspects of your approach to data management that you’d like to mention?
Manjul Maharishi: The hybrid storage capability of eXtremeDB Fusion, mentioned above, gives us useful flexibility. eXtremeDB Fusion enables the developer to specify in-memory or persistent storage for record types within a database. Storing data on a hard disk drive (HDD) or solid state drive (SSD) has two benefits: it reduces memory demands, thereby helping us stay within servers’ maximum memory capacities (this was our primary reason for using eXtremeDB Fusion), and byte for byte, persistent storage is less expensive than memory.
We first used eXtremeDB’s hybrid storage to manage a large set of meta-data for mobile handsets such as device types, model names, dates of activation, etc. This information is used by a non-call-processing application and is looked up less frequently than Carrier ENUM Registry’s real-time routing data, so we were okay with the higher latency and variance in response that is introduced by disk-based access.
We are now expanding our use of hybrid storage to add some additional information (such as mobile device information and capabilities) to stored phone numbers, in order to enhance the communication between two subscribers – for example, by enabling features such as HD-voice, Rich Communications Services (RCS), etc. These features can result in substantially increasing the database size and memory footprint required, and eXtremeDB Fusion allows us to easily configure which portions of the data set are kept in memory and which ones are kept on persistent storage with a configurable subset cached in memory – thus allowing us to store some of the less heavily used dataset in SSDs or regular HDDs, while still maintaining the high performance required for the bulk of the transactions.
In his role as Vice President of Telecom Software Development at TNS, Manjul Maharishi is responsible for overseeing architecture, design, development and testing for all of the Products and Services offered by TNS’ Telecom Services Division. These include several massively sized Telecom Databases (serving Number Portability, Toll Free, Call Routing and Calling Name services), 3G/4G Roaming Hubs, associated Clearing and Settlement services and Data Analytics.
Prior to joining TNS, Manjul has held senior technical management positions at VeriSign and Lucent Technologies working in similar areas, including building the industry’s first widely deployed Softswitch while at Lucent Technologies.
Follow ODBMS.org on Twitter: @odbmsorg
“Cybersecurity is growing in importance with Obama, Xi and Cameron having announced major efforts to gain better control over the Internet. Dataflow computing enables computation as a bump-in-the-wire without disturbing the flow of packets. Building gateways out of DFEs will significantly support the Cybersecurity agenda for years to come.”–Oskar Mencer
I have interviewed Oskar Mencer, CEO and Founder at Maxeler Technologies. Main topic of the interview is Big Data and the Networking industry, and what Maxeler Technologies is contributing in this market.
Q1. What are data flow computers and Dataflow Engines (DFEs)? What are they typically useful for?
Oskar Mencer: Dataflow computers are highly efficient systems for computational problems with large amounts of structured and unstructured data and significant mission critical computation. DFEs are units within dataflow computers which currently hold up to 96GB of memory and provide in the order of 10K parallel operations. To put that into perspective, for some tasks, a DFE has the equivalent compute capability of a farm of several normal computers, but at the fraction of the price and energy consumption.
Q2. What is data flow analytics? and why is it important?
Oskar Mencer: Dataflow analytics is a software stack on top of Dataflow computers, providing powerful compute constructs on large datasets. Dataflow analytics is a potential answer to the challenges of Big Data and the Internet of Things.
Q3. What is a programmable data plane and how can one create secure storage with it?
Oskar Mencer: Software Defined Networking is all about the programmable control plane. Maxeler’s programmable data plane is the next step in the transformation of the Networking industry.
Q4. What are the main challenges for financial institutions who need to analyze and process massive quantities of information instantly from various sources in order to make better trading decisions?
Oskar Mencer: Today’s financial institutions have a major challenge from new legislation and requirements imposed by governments. Technology can solve some of the issues, but not all of them. On the trading side, whoever manages to process more data and derive more predictive capability from it, has a better position in the marketplace. Trading is becoming more complex and more regulated, and Maxeler’s Technology, in particular as it applies to exchanges, is starting to make a significant difference in the field, helping to push the state-of-the-art while simultaneously making finance safer.
Q5. Juniper Networks announced QFX5100-AA, a new application acceleration switch, and QFX-PFA, a new packet flow accelerator module. How do they plan to use Maxeler Technologies’ dataflow computing?
Oskar Mencer: The Application Acceleration module is based on Maxeler DFEs and programmble with Maxeler dataflow programming tools and infrastructure. The variety of networking applications this enables is tremendous, as is evident from our App gallery, which includes Apps for the Juniper switch .
Q6. What are the advantages of using a Spark/Hadoop appliance using a Juniper switch with programmable data plane?
Oskar Mencer: With a Juniper switch with a programmable dataplane, one could cache answers, compute in the network, optimize and merge maps, and generally make Spark/Hadoop deployment more scalable and more efficient.
Q7. Do you see a convergence of computer, networking and storage via Dataflow Engines (DFEs)?
Oskar Mencer: Indeed, DFEs provide efficiency at the core of networking, storage as well as compute. Dataflow computing has the potential to unify computation, the movement of data and the storage of data into a single system to solve the largest Big data analytics challenges that lie ahead.
Q8. Maxeler has been mentioned in a list of major HPC applications that had an impact on Quality of Life and Prosperity. Could you please explain what is special about this HPC application?
Oskar Mencer: Maxeler solutions provide competitive advantage and help in situations with mission critical challanges. In 2011 just after the hight of the credit crisis, Maxeler won the American Finance Technology Award with JP Morgan for applying dataflow computing to credit derivatives risk computations. Dataflow computing is a good solution for challenges where computing matters.
Q9. Big Data for the Common Good. What is your take on this?
Oskar Mencer: Big Data is a means to an end. Common good arises from bringing more predictability and stability into our lives. For example, many marriages have been saved by the availability of Satnav technology in cars, clearly a Big Data challenge. Medicine is an obvious Big Data challenge. Curing a patient is as much a Big Data challenge as fighting crime, and government in general. I see Maxeler’s dataflow computing technology as a key opportunity to address the Big Data challenges of today and tomorrow.
Qx Anything else you wish to add?
Oskar Mencer: Cybersecurity is growing in importance with Obama, Xi and Cameron having announced major efforts to gain better control over the Internet. Dataflow computing enables computation as a bump-in-the-wire without disturbing the flow of packets. Building gateways out of DFEs will significantly support the Cybersecurity agenda for years to come.
Oskar Mencer is CEO and Founder at Maxeler Technologies.
Prior to founding Maxeler, Oskar was Member of Technical Staff at the Computing Sciences Center at Bell Labs in Murray Hill, leading the effort in “Stream Computing”. He joined Bell Labs after receiving a PhD from Stanford University. Besides driving Maximum Performance Computing (MPC) at Maxeler, Oskar was Consulting Professor in Geophysics at Stanford University and he is also affiliated with the Computing Department at Imperial College London, having received two Best Paper Awards, an Imperial College Research Excellence Award in 2007 and a Special Award from Com.sult in 2012 for “revolutionising the world of computers”.
Follow ODBMS.org on Twitter: @odbmsorg
“One common struggle for data-driven enterprises is managing unnecessarily complicated data workflows with bloated ETL pipelines and a lack of native system integration.”– John Leach
I have interviewed John Leach, CTO & Cofounder Splice Machine. Main topics of the interview are Hadoop, Big Data integration and what Splice Machine has to offer in this space. Monte Zweben, CEO of Splice Machine also contributed to the interview.
Q1. What are the Top Ten Pitfalls to Avoid in a SQL-on-Hadoop Implementation?
John Leach, Monte Zweben:
1. Individual record lookups. Most SQL-on-Hadoop engines are designed for full table scans in analytics, but tend to be too slow for the individual record lookups and ranges scan used by operational applications.
2. Dirty Data. Dirty data is a problem for any system, but it is compounded in Big Data, often resulting in bad reports and delays to reload an entire data set.
3. Sharding. It can be difficult to know what key to distribute data and the right shard size. This results in slow queries, especially for large joins or aggregations.
4. Hotspotting. This happens when data becomes too concentrated in a few nodes, especially for time series data. The impact is slow queries and poor parallelization.
5. SQL coverage. Limited SQL dialects will make it so you can’t run queries to meet business needs. You’ll want to make sure you do your homework. Compile the list of toughest queries and test.
6. Concurrency. Low concurrency can result in the inability to power real-time apps, handle many users, support many input sources, and deliver reports as updates happen.
7. Columnar. Not all columnar solutions are created equally. Besides columnar storage, there are many other optimizations, such as vectorization and run length encoding that can have a big impact on analytic performance. If your OLAP queries run slower, common with large joins and aggregations, this will result in poor productivity. Queries may take minutes or hours instead of seconds. On the flip-side is using columnar when you need concurrency and real-time.
8. Node Sizing. Do your homework and profile your workload. Choosing the wrong node size (e.g., CPU cores, memory) can negatively impact price/performance and create performance bottlenecks.
9. Brittle ETL on Hadoop. With many SQL-on-Hadoop solutions being unable to provide update or delete capabilities without a full data reload, this can cause a very brittle ETL that will require restarting your ETL pipeline because of errors or data quality issues. The result is a missed ETL window and delayed reports to business users.
10. Cost-Based Optimizer. A cost-based optimizer improves performance by selecting the right join strategy, the right index, and the right ordering. Some SQL-on-Hadoop engines have no cost-based optimizer or relatively immature ones that can result in poor performance and poor productivity, as well as manual tuning by DBAs.
Q2. In your experience, what are the most common problems in Big Data integration?
John Leach, Monte Zweben: Providing users access to data in a fashion they can understand and at the moment they need it, while ensuring quality and security, can be incredibly challenging.
The volume and velocity of data that businesses are churning out, along with the variety of different sources, can pose many issues.
One common struggle for data-driven enterprises is managing unnecessarily complicated data workflows with bloated ETL pipelines and a lack of native system integration. Businesses may also find their skill sets, workload, and budgets over-stretched by the need to manage terabytes or petabytes of structured and unstructured data in a way that delivers genuine value to business users.
When data is siloed and there is no solution put into place, businesses can’t access the real-time insights they need to make the best decisions for their business. Performance goes down, headaches abound and cost goes way up, all in the effort to manage the data. That’s why a Big Data integration solution is a prerequisite for getting the best performance and the most real-time insights, at the lowest cost.
Q3. What are the capabilities of Hadoop beyond data storage?
John Leach, Monte Zweben: Hadoop has a very broad range of capabilities and tools:
– Oozie for workflow
– Pig for scripting
– Mahout or SparkML for machine learning
– Kafka and Storm for streaming
– Flume and Sqoop for integration
– Hive, Impala, Spark, and Drill for SQL analytic querying
– HBase for NoSQL
– Splice Machine for operational, transactional RDBMS
Q4. What programming skills are required to handle application development around Big Data platforms like Hadoop?
John Leach, Monte Zweben: To handle application development on Hadoop, individuals have choices to go raw Hadoop or SQL-on-Hadoop. When going the SQL route, very little new skills are required and developers can open connections to an RDBMS on Hadoop just like they used to do on Oracle, DB2, SQLServer, or Teradata. Raw HAdoop application developers should know their way around the core components of the Hadoop stack–such as HDFS, MapReduce, Kafaka, Storm, Oozie, Hive, Pig, HBase, and YARN. They should also be proficient in Java.
Q5. What are the current challenges for real-time application deployment on Hadoop?
John Leach, Monte Zweben: When we talk about real-time at Splice Machine, we’re focused on applications that require not only real-time responses to queries, but also real-time database updates from a variety of data sources. The former is not all that uncommon on Hadoop; the latter is nearly impossible for most Hadoop-based systems.
Deploying real-time applications on Hadoop is really a function of moving Hadoop beyond its batch processing roots to be able to handle real-time database updates with high concurrency and transactional integrity. We harness HBase along with a lockless snapshot isolation design to provide full ACID transactions across rows and tables.
This technology enables Splice Machine to execute the high concurrency of transactions required by real-time applications.
Q6. What is special about Splice Machine auto-sharding replication and failover technology?
John Leach, Monte Zweben: As part of its automatic auto-sharding, HBase horizontally partitions or splits each table into smaller chunks or shards that are distributed across multiple servers. Using the inherent failover and replication capabilities of HBase and Hadoop, Splice Machine can support applications that demand high availability.
HBase co-processors are used to embed Splice Machine in each distributed HBase region (i.e., data shard). This enables Splice Machine to achieve massive parallelization by pushing the computation down to each distributed data shard without any overhead of MapReduce.
Q7. How difficult is it for customers to migrate from legacy databases to Splice Machine?
John Leach, Monte Zweben: Splice Machine offers a variety of services to help businesses efficiently deploy the Splice Machine database and derive maximum value from their investment. These services include both implementation consulting and educational offerings delivered by our expert team.
Splice Machine has designed a Safe Journey program to significantly ease the effort and risk for companies migrating to a Splice Machine database. The Safe Journey program includes a proven methodology that helps choose the right workloads to migrate, implements risk-mitigation best practices, and includes commercial tools that automate most of the PL/SQL conversion process.
This is not to suggest that all legacy databases will convert to a Hadoop RDBMS.
The best candidates will typically have over 1TB of data, which often leads to cost and scaling issues in legacy databases.
Q8. You have recently announced partnership with Talend, mrc (michaels, ross & cole ltd.) and RedPoint Global. Why Talend, mrc, and RedPoint Global? What is the strategic meaning of these partnerships for Splice Machine?
John Leach, Monte Zweben: Our uptick in recent partnerships demonstrates the tremendous progress our team has made over the past year. We have been working relentlessly to develop the Splice Machine Hadoop RDBMS into a fully enterprise-ready database that can replace legacy database systems.
The demand for programming talent to handle application development is growing faster than the supply of skilled talent, especially around newer platforms like Hadoop. We partnered with mrc to give businesses a solution that can speed real-time application deployment on Hadoop with the staff and tools they currently have, while also offering future-proof applications over a database that scales to meet increasing data demands.
We partnered with Talend to bring our customers the benefit of two different approaches for managing data integration affordable and at scale. Talend’s rich capabilities including drag and drop user interface, and adaptable platform allow for increased productivity and streamlined testing for faster deployment of web, mobile, OLTP or Internet of Things applications.
And finally, we integrated and certified our Hadoop RDBMS on RedPoint’s Convergent Marketing Platform™ to create a new breed of solution for marketers. With cost-efficient database scale-out and real-time cross-channel execution, the solution enables enterprises to future-proof their marketing technology investment through affordable access to all their data (social, mobile, click streams, website behaviors, etc.) across a proliferating and ever-changing list of channels. Furthermore, it complements any existing Hadoop deployment, including those on the Cloudera, MapR and Hortonworks distributions.
Q9. How is Splice Machine working with Hadoop distribution partners –such as MapR, Hortonworks and Cloudera?
John Leach, Monte Zweben: Since Splice Machine does not modify HBase, it can be used with any standard Hadoop distribution that includes HBase, including Cloudera, MapR and Hortonworks. Splice Machine enables enterprises using these three companies to tap into real-time updates with transactional integrity, an important feature for companies looking to become real-time, data-driven businesses.
In 2013, Splice Machine partnered with MapR to enable companies to use the MapR distribution for Hadoop to build their real time, SQL-on-Hadoop applications. In 2014, we joined the Cloudera Connect Partner Program, after certifying on CDH 5. We are working closely with Cloudera to maximize the potential of its full suite of Hadoop-powered software and our unique approach to real-time Hadoop.
That same year, we joined Hortonworks Technology Partner program. This enabled our users to harness innovations in management, provisioning and security for HDP deployments. For HDP users, Splice Machine enables them to build applications that use ANSI-standard SQL and support real-time updates with transactional integrity, allowing Hadoop to be used in both OLTP and OLAP applications.
Earlier this year, we were excited to achieve Hortonworks® Data Platform (HDP™) Certification. With the HDP certification, our customers can leverage the pre-built and validated integrations between leading enterprise technologies and the Hortonworks Data Platform, the industry’s only 100-percent open source Hadoop distribution, to simplify and accelerate their Splice Machine and Hadoop deployments.
Q10 What are the challenges of running online transaction processing on Hadoop?
John Leach, Monte Zweben: With its heritage as a batch processing system, Hadoop does not provide the transaction support required by online transaction processing. Transaction support can be tricky enough to implement for shared-disk RDBMSs such as Oracle, but it becomes far more difficult to implement in distributed environments such as Hadoop. A distributed transactional model requires high-levels of coordination across a cluster with too much overhead, while simultaneously providing high performance for a high concurrency of small read and writes, high-speed ingest, and massive bulk loads. We prove this by being able to run the TPC-C benchmark at scale.
Splice Machine met those requirements by using distributed snap isolation, a Multi-Version Concurrency Control model that delivers lockless, and high-concurrency transactional support. Splice Machine extended research from Google’s Percolator project, Yahoo Lab’s OMID project, and the University of Waterloo’s HBaseSI project to develop its own patent-pending, distributed transactions.
John Leach – CTO & Cofounder Splice Machine
With over 15 years of software experience under his belt, John’s expertise in analytics and BI drives his role as Chief Technology Officer. Prior to Splice Machine, John founded Incite Retail in June 2008 and led the company’s strategy and development efforts. At Incite Retail, he built custom Big Data systems (leveraging HBase and Hadoop) for Fortune 500 companies.
Prior to Incite Retail, he ran the business intelligence practice at Blue Martini Software and built strategic partnerships with integration partners. John was a key subject matter expert for Blue Martini Software in many strategic implementations across the world. His focus at Blue Martini was helping clients incorporate decision support knowledge into their current business processes utilizing advanced algorithms and machine learning.
John received dual bachelor’s degrees in biomedical and mechanical engineering from Washington University in Saint Louis. Leach currently is the organizer for the Saint Louis Hadoop Users Group and is active in the Washington University Elliot Society.
Monte Zweben – CEO & Cofounder Splice Machine
A technology industry veteran, Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program.
Monte then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit.
In 1998, Monte was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA.
Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings.
Zweben currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.