Skip to content

"Trends and Information on Big Data, Data Science, New Data Management Technologies, and Innovation."

This is the Industry Watch blog. To see the complete
website with useful articles, downloads and industry information, please click here.

Sep 4 15

On Fraud Analytics and Fraud Detection. Interview with Bart Baesens

by Roberto V. Zicari

“Many companies don’t use analytical fraud detection techniques yet. In fact, most still rely on an expert based approach, meaning that they build upon the experience, intuition and business knowledge of the fraud analyst.” –Bart Baesens

On the topics Fraud Analytics and Fraud Detection I have interviewed Bart Baesens, professor at KU Leuven (Belgium), and lecturer at the University of Southampton (United Kingdom).


Q1. What is exactly Fraud Analytics?

Good question! First of all, in our book we define fraud as an uncommon, well-considered, imperceptibly concealed, time-evolving and often carefully organized crime which appears in many types of forms. The idea of using analytics for fraud detection is catalyzed by the enormous amount of data which is currently being generated in any business process. Think about insurance claim handling, credit card transactions, cash transfers, tax payments, etc. to name a few. In our book, we discuss various ways of analyzing these massive data sets in a descriptive, predictive or social network way to come up with new analytical fraud detection models.

Q2. What are the main challenges in Fraud Analytics? 

The definition we gave above highlights the 5 key challenges in fraud analytics. The first one concerns the fact that fraud is uncommon. Independent of the exact setting or application, only a minority of the involved population of cases typically concerns fraud, of which furthermore only a limited number will be known to concern fraud. This seriously complicates the estimation of analytical models.

Fraudsters try to blend into the environment and not behave different from others in order not to get noticed and to remain covered by non-fraudsters. This effectively makes fraud imperceptibly concealed, since fraudsters do succeed in hiding by well considering and planning how to precisely commit fraud.

Fraud detection systems improve and learn by example. Therefore the techniques and tricks fraudsters adopt evolve in time along with, or better ahead of fraud detection mechanisms. This cat and mouse play between fraudsters and fraud fighters may seem to be an endless game, yet there is no alternative solution so far. By adopting and developing advanced analytical fraud detection and prevention mechanisms, organizations do manage to reduce losses due to fraud since fraudsters, like other criminals, tend to look for the easy way and will look for other, easier opportunities.

Fraud is typically a carefully organized crime, meaning that fraudsters often do not operate independently, have allies, and may induce copycats. Moreover, several fraud types such as money laundering and carousel fraud involve complex structures that are set up in order to commit fraud in an organized manner. This makes fraud not to be an isolated event, and as such in order to detect fraud the context (e.g., the social network of fraudsters) should be taken into account. This is also extensively discussed in our book.

A final element in the description of fraud provided in our book indicates the many different types of forms in which fraud occurs. This both refers to the wide set of techniques and approaches used by fraudsters as well as to the many different settings in which fraud occurs or economic activities that are susceptible to fraud.

Q3. What is the current state of the art in ensuring early detection in order to mitigate fraud damage?

Many companies don’t use analytical fraud detection techniques yet. In fact, most still rely on an expert based approach, meaning that they build upon the experience, intuition and business knowledge of the fraud analyst. Such an expert-based approach typically involves a manual investigation of a suspicious case, which may have been signaled for instance by a customer complaining of being charged for transactions he did not do. Such a disputed transaction may indicate a new fraud mechanism to have been discovered or developed by fraudsters, and therefore requires a detailed investigation for the organization to understand and subsequently address the new mechanism.

Comprehension of the fraud mechanism or pattern allows extending the fraud detection and prevention mechanism which is often implemented as a rule base or engine, meaning in the form of a set of IF-THEN rules, by adding rules that describe the newly detected fraud mechanism. These rules, together with rules describing previously detected fraud patterns, are applied to future cases or transactions and trigger an alert or signal when fraud is or may be committed by use of this mechanism. A simple, yet possibly very effective example of a fraud detection rule in an insurance claim fraud setting goes as follows:


  • Amount of claim is above threshold OR
  • Severe accident, but no police report OR
  • Severe injury, but no doctor report OR
  • Claimant has multiple versions of the accident OR
  • Multiple receipts submitted


  • Flag claim as suspicious AND
  • Alert fraud investigation officer

Such an expert approach suffers from a number of disadvantages. Rule bases or engines are typically expensive to build, since requiring advanced manual input by the fraud experts, and often turn out to be difficult to maintain and manage. Rules have to be kept up to date and only or mostly trigger real fraudulent cases, since every signaled case requires human follow-up and investigation. Therefore the main challenge concerns keeping the rule base lean and effective, in other words deciding upon when and which rules to add, remove, update, or merge.

By using data-driven analytical models such as descriptive, predictive or social network analytics in a complimentary way, we can improve the performance of our fraud detection approaches in terms of precision, cost efficiency and operational effectiveness.

Q4. Is early detection all that can be done? Are there any other advanced techniques that can be used?

You can do more than just detection. More specifically, two components that are essential parts of almost any effective strategy to fight fraud concern fraud detection and fraud prevention. Fraud detection refers to the ability to recognize or discover fraudulent activities, whereas fraud prevention refers to measures that can be taken aiming to avoid or reduce fraud. The difference between both is clear-cut, the former is an ex post approach whereas the latter an ex ante approach. Both tools may and likely should be used in a complementary manner to pursue the shared objective, being fraud reduction. However, as also discussed in our book, preventive actions will change fraud strategies and consequently impact detection power. Installing a detection system will cause fraudsters to adapt and change their behavior, and so the detection system itself will impair eventually its own detection power. So although complementary, fraud detection and prevention are not independent and therefore should be aligned and considered a whole.

Q5. How do you examine fraud patterns in historical data? 

You can examine it in two possible ways: descriptive or predictive. Descriptive analytics or unsupervised learning aims at finding unusual anomalous behavior deviating from the average behavior or norm. This norm can be defined in various ways. It can be defined as the behavior of the average customer at a snapshot in time, or as the average behavior of a given customer across a particular time period, or as a combination of both. Predictive analytics or supervised learning assumes the availability of a historical data set with known fraudulent transactions. The analytical models built can thus only detect fraud patterns as they occurred in the past. Consequently, it will be impossible to detect previously unknown fraud. Predictive analytics can however also be useful to help explain the anomalies found by descriptive analytics.

Q6. How do you typically utilize labeled, unlabeled, and networked data  for fraud detection? 

Labeled observations or transactions can be analyzed using predictive analytics. Popular techniques here are linear/logistic regression, neural networks and ensemble methods such as random forests. These techniques can be used to predict both fraud incidence, which is a classification problem, as well as fraud intensity, which is a classical regression problem. Unlabeled data can be investigated using descriptive analytics. As said, the aim here is to detect anomalies deviating from the norm. Popular techniques here are: break point analysis, peer group analysis, association rules and clustering. Networked data can be analyzed using social network techniques. We found those to be very useful in our research. Popular techniques here are community detection and featurization. In our research, we developed GOTCHA!, a supervised social network learner for fraud detection. This is also extensively discussed in our book.

Q6. Fraud techniques change over time. How do you handle this? 

Good point! A key challenge concerns the dynamic nature of fraud. Fraudsters try to constantly out beat detection and prevention systems by developing new strategies and methods. Therefore adaptive analytical models and detection and prevention systems are required, in order to detect and resolve fraud as soon as possible. Detecting fraud as early as possible is crucial. Hence, we also discuss how to continuously backtest analytical fraud detection models. The key idea here is to verify whether the fraud model still performs satisfactory. Changing fraud tactics creates concept drift implying that the relationship between the target fraud indicator and the data available changes on an on-going basis. Hence, it is important to closely follow-up the performance of the analytical model such that concept drift and any related performance deviation can be detected in a timely way. Depending upon the type of model and its purpose (e.g. descriptive or predictive), various backtesting activities can be undertaken. Examples are backtesting data stability, model stability and model calibration.

Q7. What are the  synergies between Fraud Analytics and CyberSecurity?

Fraud analytics creates both opportunities as well as threats for cybersecurity. Think about intrusion detection as an example Predictive methods can be adopted to study known intrusion patterns, whereas descriptive methods or anomaly detection can identify emerging cyber threats. The emergence of the Internet of Things (IoT) will certainly exacerbate the importance of fraud analytics for cybersecurity. Some examples of new fraud treats are:

  • Fraudsters might force access to web configurable devices (e.g. Automated Teller Machines (ATMs)) and set up fraudulent transactions;
  • Device hacking whereby fraudsters change operational parameters of connected devices (e.g. smart meters are manipulated to make them under register actual usage);
  • Denial of Service (DoS) attacks whereby fraudsters massively attack a connected device to stop it from functioning;
  • Data breach whereby a user’s log in information is obtained in a malicious way resulting into identity theft;
  • Gadget fraud also referred to as gadget lust whereby fraudsters file fraudulent claims to either obtain a new gadget or free upgrade;
  • Cyber espionage whereby exchanged data is eavesdropped by an intelligence agency or used by a company for commercial purposes.

More than ever before, fraud will be dynamic and continuously changing in an IoT context. From an analytical perspective, this implies that predictive techniques will continuously lag behind since they are based on a historical data set with known fraud patterns. Hence, as soon as the predictive model has been estimated, it will become outdated even before it has been put into production. Descriptive methods such as anomaly detection, peer group and break point analysis will gain in importance. These methods should be capable of analyzing evolving data streams and perform incremental learning to deal with concept drift. To facilitate (near) real-time fraud detection, the data and algorithms should be processed in-memory instead of relying on slow secondary storage. Furthermore, based upon the results of these analytical models, it should be possible to take fully automated actions such as the shutdown of a smart meter or ATM.

Qx Anything else you wish to add?

We are happy to refer to our book for more information. We also value your opinion and look forward to receiving any feedback (both positive and negative)!


Professor Bart Baesens is a professor at KU Leuven (Belgium), and a lecturer at the University of Southampton (United Kingdom). He has done extensive research on big data & analytics, customer relationship management, web analytics, fraud detection, and credit risk management. His findings have been published in well-known international journals and presented at international top conferences. He is also author of the books Analytics in a Big Data World (, and Fraud Analytics using Descriptive, Predictive and Social Network Techniques ( His research is summarised at He is also teaching the E-learning course, Advanced Analytics in a Big Data World, see He also regularly tutors, advises and provides consulting support to international firms with respect to their analytics and credit risk management strategy.


Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection (Wiley and SAS Business Series). Authors: Bart Baesens ,Veronique Van Vlasselaer,Wouter Verbeke.
Series: Wiley and SAS Business Series, Hardcover: 400 pages. Publisher: Wiley; 1 edition,  September 2015. ISBN-10: 1119133122

Fraud Analytics:Using Supervised, Unsupervised and Social Network Learning Techniques. Authors: Bart Baesens, Véronique Van Vlasselaer, Wouter Verbeke
Publisher: Wiley 256 pages
September 2015
ISBN-13: 978-1119133124 | ISBN-10: 1119133122

– Critical Success Factors for Analytical Models: Some Recent Research Insights. Bart Baesens,,
27 APR, 2015

– Analytics in a Big Data World: The Essential Guide to Data Science and its Applications. Bart Baesens,, 30 APR, 2014

Related Posts

The threat from AI is real, but everyone has it wrong, Robert Munro, CEO Idibon.

Follow on Twitter: @odbmsorg


Aug 19 15

On Hadoop and Big Data. Interview with Lawrence Schwartz

by Roberto V. Zicari

“The best way to define Big Data ROI is to look at how our customers define it and benefit from Hadoop.
Wellcare has been able to improve its query speeds from 30 days to just 7 days. This acceleration enabled the Company to increase its analytics and operational reporting by 73%.”–Lawrence Schwartz

I have interviewed Lawrence Schwartz, Chief Marketing Officer,Attunity.


Q1. What are the common challenges that enterprises face when trying to use Hadoop?

Lawrence Schwartz: The advent of Hadoop and Big Data has significantly changed the way organizations handle data. There’s a need now for new skills, new organizational processes, new strategies and technologies to adapt to the new playing field. It’s a change that permeates everywhere from how you touch the data, to how much you can support resource-wise and architecturally, to how you manage it and use it to stay competitive. Hadoop itself presents two primary challenges. First, the data has to come from somewhere. Enterprises must efficiently load high volumes of widely-varied data in a timely fashion. We can help with software that enables automated bulk loading into Hadoop without manual coding, and change data capture for efficient updates. The second challenge is finding engineers and Data Scientists with the right skills to exploit Hadoop. Talent is scarce in this area.

Q2. Could you give us some examples of how your customers use Hadoop for their businesses?

Lawrence Schwartz: We have an interesting range of customers using Hadoop, so I’ll provide three examples. One major cable provider we are working with uses Hadoop as a data lake. They are integrating feeds from 200 data stores into Pivotal HD. This data lake includes fresh enterprise data – fed in real-time, not just as an archival area – to run up-to-date reporting and analytics without hitting key transactional systems. This enables them to improve decision support and gain competitive advantage.

Another example of how our customers are using Hadoop highlights a Fortune 50 high technology manufacturer. This customer’s business analytics requirements were growing exponentially, straining IT resources, systems and budgets. 
The company selected Attunity Visibility to help it better understand its enterprise-wide data usage analytics across its various data platforms.
Having this capability enables the company to optimize business performance and maximize its investment in its Hadoop, data warehouse and business analytics systems. Attunity Visibility has helped to improve the customer’s system throughput by 25% enabling them to onboard new analytic applications without increasing investment in data warehouse infrastructure.

The third example is a financial services institution. This customer has many different data sources, including Hadoop, and one of its key initiatives is to streamline and optimize fraud detection. Using a historical analysis component, the organization would monitor real-time activity against historical trends to detect any suspicious activity. For example, if you go to a grocery store outside of your normal home ZIP code one day and pay for your goods with a credit card, this could trigger an alert at your bank. The bank would then see that you historically did not use your credit card at that retailer, prompting them to put a hold on your card, but potentially preventing a thief from using your card unlawfully. Using Attunity to leverage both historical and real-time transactions in its analytics, this company is able to decrease fraud and improve customer satisfaction.

Q3. How difficult is it to perform deep insight into data usage patterns? 

Lawrence Schwartz: Historically, enterprises just haven’t had the tools to efficiently understand how datasets and data warehouse infrastructure are being used. We provide Visibility software that uniquely enables organizations to understand how tables and other Data Warehouse components are being used by business lines, departments, organizations etc. It continuously collects, stores, and analyzes all queries and applications against data warehouses. They are then correlated with data usage and workload performance metrics in a centralized repository that provides detailed usage and performance metrics for the entire data warehouse. With this insight, organizations can place the right data on the right platform at the right time. This can reduce the cost and complexity of managing multiple platforms.

Q4. Do you believe that moving data across platforms is a feasible alternative for Big Data? 

Lawrence Schwartz: It really has to be, because nearly every enterprise has more than one platform, even before Hadoop is considered in the mix. Having multiple types of platforms also yields the benefits and challenges of trying to tier data based on its value, between data warehouses, Hadoop, and cloud offerings. Our customers rely on Attunity to help them with this challenge every day. Moving heterogeneous data in many different formats, and from many different sources is challenging when you don’t have the right tools or resources at your disposal. The problem gets magnified when you’re under the gun to meet real-time SLAs. In order to be able to do all of that well, you need to have a way to understand what data to move, and how to move the data easily, seamlessly and in a timely manner. Our solutions make the whole process of data management and movement automated and seamless, and that’s our hallmark.

Q5. What is “Application Release Automation” and why is it important for enterprises?

Lawrence Schwartz: Application release automation (ARA) solutions are a proven way to support Agile development, accelerate release cycles, and standardize deployment processes across all tiers of the application and content lifecycles. ARA solutions can be used to support a wide variety of activities, ranging from publishing and modifying web site content to deploying web-based tools, distributing software to business end users, and moving code between Development, Test, and Production environments.

Attunity addresses this market with an automation platform for enterprise server, web operations, shared hosting, and data center operations teams. Attunity ARA solutions are designed to offload critical, time-consuming deployment processes in complex enterprise IT environments. Enterprises that adopt ARA solutions enjoy greater business flexibility, improved productivity, better cross-team collaboration, and improved consistency.

Q6. What is your relationships with other Hadoop vendors? 

Lawrence Schwartz : Attunity has great working partnerships with all of the major Hadoop platform vendors, including Cloudera, Hortonworks, Pivotal and MapR. We have terrific synergy and work together towards a common goal – to help our customers meet the demands of a growing data infrastructure, optimize their Big Data environments, and make onboarding to Hadoop as easy as possible. Our solutions are certified with each of these vendors, so customers feel confident knowing that they can rely on us to deliver a complete and seamless joint solution for Hadoop.

Q7. Attunity recently acquired  Appfluent Technology, Inc.  and BIReady. Why Appfluent Technology? Why BIReady? How do these acquisitions fit into Attunity`s overall strategy?

Lawrence Schwartz: When we talk with enterprises today, we hear about how they are struggling to manage mountains of growing data and looking for ways to make complex processes easier. We develop software and acquire companies that help our customers streamline and optimize existing systems as well as scale to meet the growing demands of business.

Appfluent brings the Visibility software I described earlier. With Visibility, companies can rebalance data to improve performance and cost in high-scale, rapidly growing environments. They also can meet charge-back, show-back and audit requirements.

BIReady, now known as Attunity Compose, helps enterprises build and update data warehouses more easily. Data warehouse creation and administration is among the most labor-intensive and time-consuming aspects of analytics preparation. Attunity Compose overcomes the complexity with automation, using significantly less resources. It automatically designs, generates and populates enterprise data warehouses and data marts, adding data modeling and structuring capabilities inside the data warehouse.

Q8. How do you define Big Data ROI?

Lawrence Schwartz: The best way to define this is to look at how our customers define it and benefit from Hadoop.

One of our Fortune 500 customers is Wellcare, which provides managed care services to government-sponsored healthcare programs like Medicaid and Medicare. Wellcare plans to use our software to load data from its Pivotal data warehouse into Hadoop, where they will do much of their data processing and transformations. They will then move a subset of that data from Hadoop back into Pivotal and run their analytics from there. So in this case Hadoop is a staging area. As a result of implementing the first half of this solution (moving data from various databases into Pivotal), Wellcare has been able to improve its query speeds from 30 days to just 7 days. This acceleration enabled the Company to increase its analytics and operational reporting by 73%. At the same time, the solution helps Wellcare meet regulatory requirements in a timely manner more easily, ensuring that it receives the state and federal funding required to run efficiently and productively.

In another example, one of our customers, a leading online travel services company, was dealing with exploding data volumes, escalating costs and an insatiable appetite for business analytics. They selected Attunity Visibility to reduce costs and improve information agility by offloading data and workload from their legacy data warehouse systems to a Hadoop Big Data platform. Attunity Visibility has saved the company over $6 million in two years by ensuring that the right workload and data are stored and processed on the most cost-effective platform based on usage.


CUSTOMER SPOTLIGHT WEBINAR SERIES: Healthcare Success Story – How WellCare Accelerated Big Data Delivery to Improve Analytics

Related Posts

Streamlining the Big Data Landscape: Real World Network Security Usecase By Sonali Parthasarathy Accenture Technology Labs.

Thirst for Advanced Analytics Driving Increased Need for Collective Intelligence By John K. Thompson – General Manager, Advanced Analytics, Dell Software -August 2015,

Evolving Analytics by Carlos Andre Reis Pinheiro, Data Scientist, Teradata.

Business Requirements First, Technology Second BY Tamara Dull, Director of Emerging Technologies, SAS Best Practices,

A Cheat Sheet: What Executives Want to Know about Big Data by Tamara Dull, Director of Emerging Technologies for SAS Best Practices,

Follow on Twitter: @odbmsorg

Jul 31 15

In-Memory Database Technology for Telecom. Interview with Manjul Maharishi

by Roberto V. Zicari

“Our evaluation of IMDSs determined that eXtremeDB-64 IMDS outperformed other IMDSs in terms of performance and scalability.”–Manjul Maharishi.

I have interviewed Manjul Maharishi, Vice President (telecom software development) at Transaction Network Services.
They use In-Memory Database technology for managing real-time community networks in the world.


Q1. What is the mission of Transaction Network Services (TNS)?

Manjul Maharishi: Transaction Network Services manages many of the largest real-time community networks in the world, enabling industry participants to simply and securely interact and transact with other businesses, to access the data and applications they need, over managed and secure communications platforms. TNS’ existing footprint supports millions of connections and access to critical databases, enabling its customers through a single connection, a “one-to-many and many-to-many” global platform, securely blending private and public networking.

Q2. What is TNS’s Carrier ENUM Registry? And for what is it useful for?

Manjul Maharishi: Carrier ENUM Registry is a product offering for telecom carriers that provides information critical to the accurate routing and billing of inter-carrier communications, such as voice and mobile data services.
Carrier ENUM Registry addresses a challenge that is posed every time you place a phone call or send a text message: how, in the split second of latency that is deemed acceptable, will the call or message find the way to its recipient?

As a solution, Carrier ENUM Registry makes available an up-to-date, portability-corrected image of the entire public dial plan as well as authoritative information sourced directly from the service provider that “owns” (in telecom parlance, has the “right-to-use”) a particular telephone number. This is provided in the form of two registries, or databases:

Number Identity Registry is a massive repository of global telephone numbers and carrier-of-record information that identifies which service provider a telephone number was allocated to for end-user assignment. In response to lookups the registry returns a Carrier Identifier (which can be in the form of a Service Provider Identifier (SPID), and/or a Mobile Country Code+Mobile Network Code(MCC+MNC)) and when available, the Location Routing Number (LRN) of ported and pooled numbers.

Network Routing Directory is a multi-party shared registration system that furnishes service providers with sophisticated data-sharing capabilities featuring safeguard controls designed to uphold data-sharing policies. Using our secure portal service, providers self-administer data and selectively grant access, in whole or part, to trading partners (and vice versa).

Q3. Who are the customers using Carrier ENUM Registry?

Manjul Maharishi: The customers are telecom carriers – large, small and in-between – worldwide.
They are mobile, landline and IP-based, including some ISPs (Internet Service Providers) and cable MSOs (multiple system operators) that offer phone service, as well as “pure” VoIP providers. Most query the Carrier ENUM Registry deployment hosted at TNS’ facility in the US, but some host the application and database on their own premises.

Q4. What services do you support?

Manjul Maharishi: Carrier ENUM emerged as a service to connect the public switched telephone network (PSTN) and new IP-based networks, by resolving phone numbers to IP addresses and services. It also provided a bridge between IP-based carriers. For example, with a multi-vendor database of routes, users and phone numbers available, a caller on IP-based Network A could communicate with a user of IP-based Network B without routing calls across the PSTN (which would incur costs and may require avoidable transcoding).

Over time, though, Carrier ENUM Registry has gained complexity along with new features, and does much more than bridging between carriers. Supported services now include number portability, IP-peering between telephone service providers, SMS/MMS (aka “text messages”) routing, unbundling of services (allowing messaging to be offered separately from voice, for example), customized views of data, routing based on time/destination/origination, and more. These services have added complexity to TNS’ Carrier ENUM Registry business logic and have caused its databases to grow larger and the routing logic to become more complex.

Carriers can pick and choose from the various Carrier ENUM Registry features, to solve their particular challenges.
One of the biggest use cases in demand now is identifying the right carrier to terminate an SMS or MMS when number portability is involved in the host country.

Q5. What kind of real-time performance demands does Carrier ENUM Registry need to satisfy?

Manjul Maharishi: To customers, we commit to providing a response from our system within 10 milliseconds for 95% of the queries. Please note that a single customer query can result in dozens to a couple of hundred individual table queries based on the routing logic and services subscribed. However, largely through the use of in-memory database system (IMDS) technology for data management, we have been able to have a much lower variance in the query responses and a higher degree of predictability. Our typical average response to a customer query is less than 2 msec. These numbers only reflect the latency introduced by our platform, i.e. the time difference between when we receive the query and when we respond back.
The network latency – the time when the query leaves the customer network and when they receive the response – is larger (typical US cross-country network latency is 60-100 msec). An industry norm for the maximum acceptable time from when a subscriber dials digits to when they hear a ringing tone back is ~150-200 msecs, beyond which the “dead air/silence” becomes noticeable for the subscriber. However, for international calls, people do tend to be more tolerant of such post-dial delays.

Q6. Can you give an overview of the system architecture and toolset used to handle the increasingly complex business logic and growing data volume?

Manjul Maharishi: In order to handle the growing amount of stored data, we use general-purpose off-the-shelf Linux servers. This allows us to take advantage of industry-wide gains in processing power, memory and performance, as well as eliminate any dependence upon specific vendors for a software/hardware upgrade cycle. Currently, the systems are running on dual CPU, 6- and 8-core processors.

For data management we use eXtremeDB-64, the 64-bit edition of McObject’s eXtremeDB In-Memory Database System (IMDS). The system is architected such that each server stores the entire database, and customer queries are load balanced across a set of such servers. Accordingly, the platform is easily scaled by adding new servers as needed. Apart from offering this service as a cloud-based offering (“Central Replica”), we also offer the service as customer-premise deployment model (“Local Replica”) whereby the customers can gain from a much lower round-trip time (RTT) by avoiding network-latency. The TNS network operations center (NOC) monitors key performance indicators of our Central and Local Replica servers on a 24×7 basis, and we have agreements in place with our customers to scale up the platform by adding more servers if needed.

With the performance provided by eXtremeDB-64, we haven’t had a need to partition the data set in order to meet our commitments. We do use the database system’s Patricia trie indices to reduce the number of lookups required on certain tables, and work through the business logic to narrow down the search results to a manageable number early in the business rules processing.

In terms of development tools, we are developing in C++ using eXtremeDB’s C/C++ API instead of accessing via an SQL API, and this contributes to lower application latency. We develop software using Agile methodology with Continuous Integration that has nightly builds with a suite of automated tests executing during these builds. We also incorporate code coverage, leak detection and profiling as part of this Continuous Integration.

Q7. Can you tell more about how Carrier ENUM Registry meets its real-time data access requirements? Did it move to in-memory database technology recently or has this always been a feature?

Manjul Maharishi: The system architecture keeps data needed for real-time queries in memory, where it can be accessed quickly. Early versions of Carrier ENUM Registry accomplished searches using in-memory database code developed in-house for the application. However, TNS recognized several years ago that with the increasingly complex queries and higher data volumes, Carrier ENUM Registry would be better-served by an off-the-shelf in-memory database system (IMDS) that provides flexibility while scaling to hundreds of millions and even billions of records. After researching IMDSs, we chose the 64-bit eXtremeDB-64 and the new Carrier ENUM Registry version incorporating eXtremeDB-64 launched in 2013.

Currently, the system holds a master or archival data set in Oracle Enterprise DBMSs, with the data used for real-time lookups hosted “downstream” in eXtremeDB-64. Each downstream server hosts the entire data set used by the application; this data set consists of three separate (i.e. with unique schemas and data) databases with a combined size of 120 GB.
Two of the databases managed by eXtremeDB-64 on each server are “pure” in-memory databases while the third utilizes McObject’s eXtremeDB Fusion technology to include some persistent (on-disk) data storage.

Q8. Why did you choose eXtremeDB-64 from the field of available IMDSs and what has been your experience been using it?

Manjul Maharishi: Our evaluation of IMDSs determined that eXtremeDB-64 IMDS outperformed other IMDSs in terms of performance and scalability. Among other test findings, TNS determined that eXtremeDB’s performance exceeded 2 million queries per second with a 10 million-row database. When TNS upped the challenge by increasing the test database size 3000% (to 300 million records), eXtremeDB’s responsiveness fell only minimally, validating the near-linear scalability results documented in McObject’s published benchmarks. TNS’ platform for these tests consisted of Intel Xeon X5570 2.93 GHz hardware, with 8 cores and hyper-threading enabled, running Red Had Enterprise Linux 4, with 72 GB RAM.
Using eXtremeDB-64 in production with Carrier ENUM Registry has borne out our expectations: the database system meets current needs while providing room for future growth in both database size and complexity of application features.

Q9. You mentioned the use of the Patricia trie index in your database. Can you elaborate on the advantage it provides?

Manjul Maharishi: Support for the Patricia trie database index is another key eXtremeDB-64 feature (along with in-memory data storage) that enables Carrier ENUM Registry to meet its performance goals. The name of this specialized index derives from “Practical Algorithm To Retrieve Information Coded In Alphanumeric” and “reTRIEval”. Unlike the widely used B-tree index – which can also be used for finding keys with a specified prefix but can require multiple iterations to find the longest prefix match if there are multiple prefix matches in the index – the Patricia trie excels in searching for the longest prefixes of a specified value.

This approach meshes well with the unique nature of Carrier ENUM Registry’s data, and its queries. Phone numbers serve as keys for the searches that are performed when a call is placed. The key is stored on individual numbers, blocks or ranges.
A block consists of phone numbers in a quantity ranging from 1,000 to 10,000. For example, it could be the number 703667 and four additional digits ranging from 0000 to 9999. A range is a subset of a block. The key would be applied to an individual number, for example, when that number was ported from another carrier and does not fall into a large block of numbers serviced by the company using TNS’ application.

In most of the cases, it is not known beforehand if the number being queried has been “ported out” of a block or not, so the application (in the absence of the Patricia trie) would have to make multiple queries – starting with the most specific match and then dropping the least-significant digit one-by-one till a match was found. With the Patricia trie, there is only one iteration within the original query, which is much less taxing in performance terms and greatly simplifies the application logic.

Q10. Are there other aspects of your approach to data management that you’d like to mention?

Manjul Maharishi: The hybrid storage capability of eXtremeDB Fusion, mentioned above, gives us useful flexibility. eXtremeDB Fusion enables the developer to specify in-memory or persistent storage for record types within a database. Storing data on a hard disk drive (HDD) or solid state drive (SSD) has two benefits: it reduces memory demands, thereby helping us stay within servers’ maximum memory capacities (this was our primary reason for using eXtremeDB Fusion), and byte for byte, persistent storage is less expensive than memory.

We first used eXtremeDB’s hybrid storage to manage a large set of meta-data for mobile handsets such as device types, model names, dates of activation, etc. This information is used by a non-call-processing application and is looked up less frequently than Carrier ENUM Registry’s real-time routing data, so we were okay with the higher latency and variance in response that is introduced by disk-based access.

We are now expanding our use of hybrid storage to add some additional information (such as mobile device information and capabilities) to stored phone numbers, in order to enhance the communication between two subscribers – for example, by enabling features such as HD-voice, Rich Communications Services (RCS), etc. These features can result in substantially increasing the database size and memory footprint required, and eXtremeDB Fusion allows us to easily configure which portions of the data set are kept in memory and which ones are kept on persistent storage with a configurable subset cached in memory – thus allowing us to store some of the less heavily used dataset in SSDs or regular HDDs, while still maintaining the high performance required for the bulk of the transactions.

In his role as Vice President of Telecom Software Development at TNS, Manjul Maharishi is responsible for overseeing architecture, design, development and testing for all of the Products and Services offered by TNS’ Telecom Services Division. These include several massively sized Telecom Databases (serving Number Portability, Toll Free, Call Routing and Calling Name services), 3G/4G Roaming Hubs, associated Clearing and Settlement services and Data Analytics.

Prior to joining TNS, Manjul has held senior technical management positions at VeriSign and Lucent Technologies working in similar areas, including building the industry’s first widely deployed Softswitch while at Lucent Technologies.



– eXtremeDB Case Study: Industry Trend Toward Algorithmic Trading

– eXtremeDB Embedded Database Version 6.0

Related Posts

– Gartner Market Guide for In-Memory DBMS

– Looking beyond the DBMS: Towards Holistic Performance Optimization for Enterprise Architectures

– Gaining An Extreme Performance Advantage

– Database Persistence, Without The Performance Penalty

Follow on Twitter: @odbmsorg


Jul 23 15

Big Data and the Networking industry. Interview with Oskar Mencer

by Roberto V. Zicari

“Cybersecurity is growing in importance with Obama, Xi and Cameron having announced major efforts to gain better control over the Internet. Dataflow computing enables computation as a bump-in-the-wire without disturbing the flow of packets. Building gateways out of DFEs will significantly support the Cybersecurity agenda for years to come.”–Oskar Mencer

I have interviewed Oskar Mencer, CEO and Founder at Maxeler Technologies. Main topic of the interview is Big Data and the Networking industry, and what Maxeler Technologies is contributing in this market.


Q1. What are data flow computers and Dataflow Engines (DFEs)? What are they typically useful for?

Oskar Mencer: Dataflow computers are highly efficient systems for computational problems with large amounts of structured and unstructured data and significant mission critical computation. DFEs are units within dataflow computers which currently hold up to 96GB of memory and provide in the order of 10K parallel operations. To put that into perspective, for some tasks, a DFE has the equivalent compute capability of a farm of several normal computers, but at the fraction of the price and energy consumption.

Q2. What is data flow analytics? and why is it important?

Oskar Mencer: Dataflow analytics is a software stack on top of Dataflow computers, providing powerful compute constructs on large datasets. Dataflow analytics is a potential answer to the challenges of Big Data and the Internet of Things.

Q3. What is a programmable data plane and how can one create secure storage with it?

Oskar Mencer: Software Defined Networking is all about the programmable control plane. Maxeler’s programmable data plane is the next step in the transformation of the Networking industry.

Q4. What are the main challenges for financial institutions who need to analyze and process massive quantities of information instantly from various sources in order to make better trading decisions?

Oskar Mencer: Today’s financial institutions have a major challenge from new legislation and requirements imposed by governments. Technology can solve some of the issues, but not all of them. On the trading side, whoever manages to process more data and derive more predictive capability from it, has a better position in the marketplace. Trading is becoming more complex and more regulated, and Maxeler’s Technology, in particular as it applies to exchanges, is starting to make a significant difference in the field, helping to push the state-of-the-art while simultaneously making finance safer.

Q5. Juniper Networks announced QFX5100-AA, a new application acceleration switch, and QFX-PFA, a new packet flow accelerator module. How do they plan to use Maxeler Technologies’ dataflow computing?

Oskar Mencer: The Application Acceleration module is based on Maxeler DFEs and programmble with Maxeler dataflow programming tools and infrastructure. The variety of networking applications this enables is tremendous, as is evident from our App gallery, which includes Apps for the Juniper switch .

Q6. What are the advantages of using a Spark/Hadoop appliance using a Juniper switch with programmable data plane?

Oskar Mencer: With a Juniper switch with a programmable dataplane, one could cache answers, compute in the network, optimize and merge maps, and generally make Spark/Hadoop deployment more scalable and more efficient.

Q7. Do you see a convergence of computer, networking and storage via Dataflow Engines (DFEs)?

Oskar Mencer: Indeed, DFEs provide efficiency at the core of networking, storage as well as compute. Dataflow computing has the potential to unify computation, the movement of data and the storage of data into a single system to solve the largest Big data analytics challenges that lie ahead.

Q8. Maxeler has been mentioned in a list of major HPC applications that had an impact on Quality of Life and Prosperity. Could you please explain what is special about this HPC application?

Oskar Mencer: Maxeler solutions provide competitive advantage and help in situations with mission critical challanges. In 2011 just after the hight of the credit crisis, Maxeler won the American Finance Technology Award with JP Morgan for applying dataflow computing to credit derivatives risk computations. Dataflow computing is a good solution for challenges where computing matters.

Q9. Big Data for the Common Good. What is your take on this?

Oskar Mencer: Big Data is a means to an end. Common good arises from bringing more predictability and stability into our lives. For example, many marriages have been saved by the availability of Satnav technology in cars, clearly a Big Data challenge. Medicine is an obvious Big Data challenge. Curing a patient is as much a Big Data challenge as fighting crime, and government in general. I see Maxeler’s dataflow computing technology as a key opportunity to address the Big Data challenges of today and tomorrow.

Qx Anything else you wish to add?

Oskar Mencer: Cybersecurity is growing in importance with Obama, Xi and Cameron having announced major efforts to gain better control over the Internet. Dataflow computing enables computation as a bump-in-the-wire without disturbing the flow of packets. Building gateways out of DFEs will significantly support the Cybersecurity agenda for years to come.

Oskar Mencer is CEO and Founder at Maxeler Technologies.
Prior to founding Maxeler, Oskar was Member of Technical Staff at the Computing Sciences Center at Bell Labs in Murray Hill, leading the effort in “Stream Computing”. He joined Bell Labs after receiving a PhD from Stanford University. Besides driving Maximum Performance Computing (MPC) at Maxeler, Oskar was Consulting Professor in Geophysics at Stanford University and he is also affiliated with the Computing Department at Imperial College London, having received two Best Paper Awards, an Imperial College Research Excellence Award in 2007 and a Special Award from Com.sult in 2012 for “revolutionising the world of computers”.



Programming MPC Systems. White Paper — Maxeler Technologies,

Related Posts

Streamlining the Big Data Landscape: Real World Network Security Usecase. By Sonali Parthasarathy Accenture Technology Labs.

WHY DATA SCIENCE NEEDS STORY TELLING. BY Steve Lohr, technology reporter for the New York

Pre-emptive Financial Markets Regulation – next step for Big Data. By Morgan Deane, Helvea-Baader Bank Group.

Data, Process and Scenario Analytics: An Emerging Regulatory Line of Offence. BY Dr. Ramendra K Sahoo, KPMG Financial Risk Management.

Follow on Twitter: @odbmsorg


Jul 13 15

On Hadoop and Big Data. Interview with John Leach

by Roberto V. Zicari

“One common struggle for data-driven enterprises is managing unnecessarily complicated data workflows with bloated ETL pipelines and a lack of native system integration.”– John Leach

I have interviewed John Leach, CTO & Cofounder Splice Machine.  Main topics of the interview are Hadoop, Big Data integration and what Splice Machine has to offer in this space.  Monte Zweben, CEO of Splice Machine also contributed to the interview.


Q1. What are the Top Ten Pitfalls to Avoid in a SQL-on-Hadoop Implementation?

John Leach, Monte Zweben:
1. Individual record lookups. Most SQL-on-Hadoop engines are designed for full table scans in analytics, but tend to be too slow for the individual record lookups and ranges scan used by operational applications.
2. Dirty Data. Dirty data is a problem for any system, but it is compounded in Big Data, often resulting in bad reports and delays to reload an entire data set.
3. Sharding. It can be difficult to know what key to distribute data and the right shard size. This results in slow queries, especially for large joins or aggregations.
4. Hotspotting. This happens when data becomes too concentrated in a few nodes, especially for time series data. The impact is slow queries and poor parallelization.
5. SQL coverage. Limited SQL dialects will make it so you can’t run queries to meet business needs. You’ll want to make sure you do your homework. Compile the list of toughest queries and test.
6. Concurrency. Low concurrency can result in the inability to power real-time apps, handle many users, support many input sources, and deliver reports as updates happen.
7. Columnar. Not all columnar solutions are created equally. Besides columnar storage, there are many other optimizations, such as vectorization and run length encoding that can have a big impact on analytic performance. If your OLAP queries run slower, common with large joins and aggregations, this will result in poor productivity. Queries may take minutes or hours instead of seconds. On the flip-side is using columnar when you need concurrency and real-time.
8. Node Sizing. Do your homework and profile your workload. Choosing the wrong node size (e.g., CPU cores, memory) can negatively impact price/performance and create performance bottlenecks.
9. Brittle ETL on Hadoop. With many SQL-on-Hadoop solutions being unable to provide update or delete capabilities without a full data reload, this can cause a very brittle ETL that will require restarting your ETL pipeline because of errors or data quality issues. The result is a missed ETL window and delayed reports to business users.
10. Cost-Based Optimizer. A cost-based optimizer improves performance by selecting the right join strategy, the right index, and the right ordering. Some SQL-on-Hadoop engines have no cost-based optimizer or relatively immature ones that can result in poor performance and poor productivity, as well as manual tuning by DBAs.

Q2. In your experience, what are the most common problems in Big Data integration?

John Leach, Monte Zweben: Providing users access to data in a fashion they can understand and at the moment they need it, while ensuring quality and security, can be incredibly challenging.

The volume and velocity of data that businesses are churning out, along with the variety of different sources, can pose many issues.

One common struggle for data-driven enterprises is managing unnecessarily complicated data workflows with bloated ETL pipelines and a lack of native system integration. Businesses may also find their skill sets, workload, and budgets over-stretched by the need to manage terabytes or petabytes of structured and unstructured data in a way that delivers genuine value to business users.

When data is siloed and there is no solution put into place, businesses can’t access the real-time insights they need to make the best decisions for their business. Performance goes down, headaches abound and cost goes way up, all in the effort to manage the data. That’s why a Big Data integration solution is a prerequisite for getting the best performance and the most real-time insights, at the lowest cost.

Q3. What are the capabilities of Hadoop beyond data storage?

John Leach, Monte Zweben: Hadoop has a very broad range of capabilities and tools:

Oozie for workflow
Pig for scripting
Mahout or SparkML for machine learning
Kafka and Storm for streaming
Flume and Sqoop for integration
Hive, Impala, Spark, and Drill for SQL analytic querying
HBase for NoSQL
Splice Machine for operational, transactional RDBMS

Q4. What programming skills are required to handle application development around Big Data platforms like Hadoop?

John Leach, Monte Zweben: To handle application development on Hadoop, individuals have choices to go raw Hadoop or SQL-on-Hadoop. When going the SQL route, very little new skills are required and developers can open connections to an RDBMS on Hadoop just like they used to do on Oracle, DB2, SQLServer, or Teradata. Raw HAdoop application developers should know their way around the core components of the Hadoop stack–such as HDFS, MapReduce, Kafaka, Storm, Oozie, Hive, Pig, HBase, and YARN. They should also be proficient in Java.

Q5. What are the current challenges for real-time application deployment on Hadoop?

John Leach, Monte Zweben: When we talk about real-time at Splice Machine, we’re focused on applications that require not only real-time responses to queries, but also real-time database updates from a variety of data sources. The former is not all that uncommon on Hadoop; the latter is nearly impossible for most Hadoop-based systems.

Deploying real-time applications on Hadoop is really a function of moving Hadoop beyond its batch processing roots to be able to handle real-time database updates with high concurrency and transactional integrity. We harness HBase along with a lockless snapshot isolation design to provide full ACID transactions across rows and tables.

This technology enables Splice Machine to execute the high concurrency of transactions required by real-time applications.

Q6. What is special about Splice Machine auto-sharding replication and failover technology?

John Leach, Monte Zweben: As part of its automatic auto-sharding, HBase horizontally partitions or splits each table into smaller chunks or shards that are distributed across multiple servers. Using the inherent failover and replication capabilities of HBase and Hadoop, Splice Machine can support applications that demand high availability.

HBase co-processors are used to embed Splice Machine in each distributed HBase region (i.e., data shard). This enables Splice Machine to achieve massive parallelization by pushing the computation down to each distributed data shard without any overhead of MapReduce.

Q7. How difficult is it for customers to migrate from legacy databases to Splice Machine?

John Leach, Monte Zweben: Splice Machine offers a variety of services to help businesses efficiently deploy the Splice Machine database and derive maximum value from their investment. These services include both implementation consulting and educational offerings delivered by our expert team.

Splice Machine has designed a Safe Journey program to significantly ease the effort and risk for companies migrating to a Splice Machine database. The Safe Journey program includes a proven methodology that helps choose the right workloads to migrate, implements risk-mitigation best practices, and includes commercial tools that automate most of the PL/SQL conversion process.

This is not to suggest that all legacy databases will convert to a Hadoop RDBMS.
The best candidates will typically have over 1TB of data, which often leads to cost and scaling issues in legacy databases.

Q8. You have recently announced partnership with Talend, mrc (michaels, ross & cole ltd.) and RedPoint Global. Why Talend, mrc, and RedPoint Global? What is the strategic meaning of these partnerships for Splice Machine?

John Leach, Monte Zweben: Our uptick in recent partnerships demonstrates the tremendous progress our team has made over the past year. We have been working relentlessly to develop the Splice Machine Hadoop RDBMS into a fully enterprise-ready database that can replace legacy database systems.

The demand for programming talent to handle application development is growing faster than the supply of skilled talent, especially around newer platforms like Hadoop. We partnered with mrc to give businesses a solution that can speed real-time application deployment on Hadoop with the staff and tools they currently have, while also offering future-proof applications over a database that scales to meet increasing data demands.

We partnered with Talend to bring our customers the benefit of two different approaches for managing data integration affordable and at scale. Talend’s rich capabilities including drag and drop user interface, and adaptable platform allow for increased productivity and streamlined testing for faster deployment of web, mobile, OLTP or Internet of Things applications.

And finally, we integrated and certified our Hadoop RDBMS on RedPoint’s Convergent Marketing Platform™ to create a new breed of solution for marketers. With cost-efficient database scale-out and real-time cross-channel execution, the solution enables enterprises to future-proof their marketing technology investment through affordable access to all their data (social, mobile, click streams, website behaviors, etc.) across a proliferating and ever-changing list of channels. Furthermore, it complements any existing Hadoop deployment, including those on the Cloudera, MapR and Hortonworks distributions.

Q9. How is Splice Machine working with Hadoop distribution partners –such as MapR, Hortonworks and Cloudera?

John Leach, Monte Zweben: Since Splice Machine does not modify HBase, it can be used with any standard Hadoop distribution that includes HBase, including Cloudera, MapR and Hortonworks. Splice Machine enables enterprises using these three companies to tap into real-time updates with transactional integrity, an important feature for companies looking to become real-time, data-driven businesses.

In 2013, Splice Machine partnered with MapR to enable companies to use the MapR distribution for Hadoop to build their real time, SQL-on-Hadoop applications. In 2014, we joined the Cloudera Connect Partner Program, after certifying on CDH 5. We are working closely with Cloudera to maximize the potential of its full suite of Hadoop-powered software and our unique approach to real-time Hadoop.

That same year, we joined Hortonworks Technology Partner program. This enabled our users to harness innovations in management, provisioning and security for HDP deployments. For HDP users, Splice Machine enables them to build applications that use ANSI-standard SQL and support real-time updates with transactional integrity, allowing Hadoop to be used in both OLTP and OLAP applications.

Earlier this year, we were excited to achieve Hortonworks® Data Platform (HDP™) Certification. With the HDP certification, our customers can leverage the pre-built and validated integrations between leading enterprise technologies and the Hortonworks Data Platform, the industry’s only 100-percent open source Hadoop distribution, to simplify and accelerate their Splice Machine and Hadoop deployments.

Q10 What are the challenges of running online transaction processing on Hadoop?

John Leach, Monte Zweben: With its heritage as a batch processing system, Hadoop does not provide the transaction support required by online transaction processing. Transaction support can be tricky enough to implement for shared-disk RDBMSs such as Oracle, but it becomes far more difficult to implement in distributed environments such as Hadoop. A distributed transactional model requires high-levels of coordination across a cluster with too much overhead, while simultaneously providing high performance for a high concurrency of small read and writes, high-speed ingest, and massive bulk loads. We prove this by being able to run the TPC-C benchmark at scale.

Splice Machine met those requirements by using distributed snap isolation, a Multi-Version Concurrency Control model that delivers lockless, and high-concurrency transactional support. Splice Machine extended research from Google’s Percolator project, Yahoo Lab’s OMID project, and the University of Waterloo’s HBaseSI project to develop its own patent-pending, distributed transactions.


John LeachCTO & Cofounder Splice Machine
With over 15 years of software experience under his belt, John’s expertise in analytics and BI drives his role as Chief Technology Officer. Prior to Splice Machine, John founded Incite Retail in June 2008 and led the company’s strategy and development efforts. At Incite Retail, he built custom Big Data systems (leveraging HBase and Hadoop) for Fortune 500 companies.
Prior to Incite Retail, he ran the business intelligence practice at Blue Martini Software and built strategic partnerships with integration partners. John was a key subject matter expert for Blue Martini Software in many strategic implementations across the world. His focus at Blue Martini was helping clients incorporate decision support knowledge into their current business processes utilizing advanced algorithms and machine learning.
John received dual bachelor’s degrees in biomedical and mechanical engineering from Washington University in Saint Louis. Leach currently is the organizer for the Saint Louis Hadoop Users Group and is active in the Washington University Elliot Society.

Monte Zweben – CEO & Cofounder Splice Machine
A technology industry veteran, Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program.
Monte then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit.

In 1998, Monte was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA.
Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings.

Zweben currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.



– Splice Machine resource page,

Related Posts

Common misconceptions about SQL on Hadoop. By Cynthia M. Saracco,, July 2015

– SQL over Hadoop: Performance isn’t everything… By Simon Harris,, March 2015

– Archiving Everything with Hadoop. By Mark Cusack, December 2014.

–  On Hadoop RDBMS. Interview with Monte Zweben. ODBMS Industry Watch  November 2, 2014

– AsterixDB: Better than Hadoop? Interview with Mike Carey, ODBMS Industry Watch, October 22, 2014


Follow on Twitter: @odbmsorg



Jun 24 15

On Apache Flink. Interview with Volker Markl.

by Roberto V. Zicari

“I would consider stream data analysis to be a major unique selling proposition for Flink. Due to its pipelined architecture Flink is a perfect match for big data stream processing in the Apache stack.”–Volker Markl

I have interviewed Volker Markl, Professor and Chair of the Database Systems and Information Management group at the Technische Universität Berlin. Main topic of the interview is the Apache Top-Level Project, Flink.


Q1. Was it difficult for the Stratosphere Research Project (i.e., a project originating in Germany) to evolve and become an Apache Top-Level Project under the name Flink?

Volker Markl: I do not have a frame of reference. However, I would not consider the challenges for a research project originating in Germany to be any different from any other research project underway anywhere else in the world.
Back in 2008, when I conceived the idea for Stratosphere and attracted co-principal investigators from TU Berlin, HU Berlin, and the Hasso Plattner Institute Potsdam, we jointly worked on a vision and had already placed a strong emphasis on systems building and open-source development early on. It took our team about three years to deliver the first open-source version of Stratosphere and then it took us several more years to gain traction and increase our visibility.
We had to make strides to raise awareness and make the Stratosphere Research Project more widely known in the academic, commercial, research, and open-source communities, particularly, on a global scale. Unfortunately, despite our having started in 2008, we had not foreseen there being a naming problem. The name Stratosphere was trademarked by a commercial entity and as such we had to rename our open-source system. Upon applying for Apache incubation, we put the renaming issue to a vote and finally agreed upon the name Flink, a name that I am very happy with.
Flink is a German word that means ‘agile or swift.’ It suits us very well since this is what the original project was about. Overall, I would say, our initiating this project in Germany (or in Europe for that matter) did not impose any major difficulties.

Q2. What are the main data analytics challenges that Flink is attempting to address?

Volker Markl: Our key vision for both Stratosphere and now Flink was “to reduce the complexity that other distributed data analysis engines exhibit, by integrating concepts from database systems, such as declarative languages, query optimization, and efficient parallel in-memory and out-of-core algorithms, with the Map/Reduce framework, which allows for schema on read, efficient processing of user-code, and massive scale-out.” In addition, we introduced two novel features.
One focused on the ‘processing of iterative algorithms’ and the other on ‘streaming.’ For the former, we recognized that fixed-point iterations were crucial for data analytics.
Hence, we incorporated varying iterative algorithm processing optimizations.
For example, we use delta-iterations to avoid unnecessary work, reduce communication, and run analytics faster.
Moreover, this concept of iterative computations is tightly woven into the Flink query optimizer, thereby alleviating the data scientist from (i) having to worry about caching decisions, (ii) moving invariant code out of the loop, and (iii) thinking about & building indexes for data used between iterations. For the latter, since Flink is based on a pipelined execution engine akin to parallel database systems, this formed a good basis for us to integrate streaming operations with rich windowing semantics seamlessly into the framework. This allows Flink to process streaming operations in a pipelined way with lower latency than (micro-)batch architectures and without the complexity of lambda architectures.

Q3. Why is Flink an alternative to Hadoop’s MapReduce solution?

Volker Markl: Flink is a scalable data analytics framework that is fully compatible with the Hadoop ecosystem.
In fact, most users employ Flink in Hadoop clusters. Flink can run on Yarn and it can read from & write to HDFS. Flink is compatible with all Hadoop input & output formats and (as of recently and in a beta release) even has a Map/Reduce compatibility mode. Additionally, it supports a mode to execute Flink programs on Apache Tez. Flink can handle far more complex analyses than Map/Reduce programs. Its programming model offers higher order functions, such as joins, unions, and iterations. This makes coding analytics simpler than in Map/Reduce. For large pipelines consisting of many Map/Reduce stages Flink has an optimizer, similar to what Hive or Pig offer for Map/Reduce for relational operations. However, in contrast, Flink optimizes extended Map/Reduce programs and not scripting language programs built on top.
In this manner, Flink reduces an impedance mismatch for programmers. Furthermore, Flink has shown to grossly outperform Map/Reduce for many operations out of the box and since Flink is a stream processor at its core, it can also process continuous streams.

Q4. Could you share some details about Flink’s current performance and how you reduce latency? 

Volker Markl: Flink is a pipelined engine. A great deal of effort has been placed in enabling efficient memory management.
The system gracefully switches between in-memory and out-of-core algorithms.
The Flink query optimizer intelligently leverages partitioning and other interesting data properties in more complex analysis flows. Thereby, reducing communication, process-ing overhead, and thus latency. In addition, the delta iteration feature reduces the overhead during iterative computations, speeds up analytics, and shortens execution time. There are several performance studies on the web that show that Flink has very good performance or outperforms other systems.

Q5. What about Flink’s reliability and ease of use?

Volker Markl: We have had very good feedback regarding both usability and reliability. It is extremely easy to get started with Flink if you are familiar with Java, Scala, or Python. Flink APIs are very clean. For example, the table, graph, and dataset APIs are easy to use for anyone who has been writing data analytics programs in Java and Scala or in systems, such as MATLAB, Python, or R.
Flink supports a local mode for debugging and a lot of effort has been put on it requiring little configuration, so that developers can move a job to production with small effort.
Flink has had native memory management and operations on serialized data from very early on. This reduces configuration and enables very robust job execution.
The system has been tested on clusters with hundreds of nodes. Projects that develop notebook functionality for rapid prototyping, namely Apache Zeppelin are integrating with Flink to further reduce overhead and get an analysis pipeline up and running.
Like other open-source projects, Flink is constantly improving its reliability and ease-of-use with each release. Most recently, a community member created an interactive shell, which will make it easier for first-time users to conduct data analysis with Flink. The Berlin Big Data Center ( is currently prototyping machine learning and text mining libraries for Flink based on the Apache Mahout DSL.
SICS (The Swedish Institute for Computer Science) in Stockholm is currently working on a solution to ease installation, whereas Data Artisans is providing tooling to further improve the ease of use.

Q6. How well does Flink perform for real time (as opposed to batch)  big data analytics?

Volker Markl: I would consider stream data analysis to be a major unique selling proposition for Flink. Due to its pipelined architecture Flink is a perfect match for big data stream processing in the Apache stack. It provides native data streams with window operations and an API for streaming that matches the API for the analysis of data at rest.
The community has added a novel way to checkpoint streams with low overhead and is now working on surfacing persistent state functionality.
Data does not have to be moved across system boundaries (e.g., as in a lambda architecture) when combining both streams and datasets. Programmers do not have to learn different programming paradigms when crafting an analysis. Administrators do not have to manage the complexity of multiple engines as in a lambda architecture (for instance, managing version compatibility). And of course the performance shows a clear benefit due to deep integration.

Q7. What are the new Flink features that the community is currently working on?

Volker Markl: There are plenty of new features. A major ongoing effort is graduating Flink’s streaming API and capabilities from beta status. A recent blog post details this work ( Another effort is continuing to expand Flink’s libraries, namely, FlinkML for Machine Learning & Gelly for graph processing by adding more algorithms.
Flink’s Table API is a first step towards SQL support, which is planned for both batch and streaming jobs. The ecosystem around Flink is also growing with systems, such as Apache Zeppelin, Apache Ignite, and Google Cloud Dataflow integrating with Flink.

Q8. What role does Data Artisans (a Berlin-based startup) play in the Flink project?

Volker Markl: The startup data Artisans was created by a team of core Flink committers & initiators of the Flink project. They are committed to growing the Apache Flink community and code base.

Q9. Is Flink an alternative to Spark and Storm?

Volker Markl: I would consider Flink to be an alternative to Spark for batch processing, if you need graceful degradation for out-of-core operations or processing iterative algorithms that can be incrementalized. Also, Flink is an alternative to Spark, if you need real data streaming with a latency that the Spark microbatch processing cannot provide. Flink is an alternative to any lambda architecture, involving Storm with either Hadoop or Spark, as it can process richer operations and can easily process data at rest and data in motion jointly in a single processing framework.

Q10. What are the major differences between Flink, Spark, and Storm?

Volker Markl: Overall, the core distinguishing feature of Flink over the other systems is an efficient native streaming engine that supports both batch processing and delta iterations. In particular, it enables efficient machine learning and graph analysis through query optimization across APIs as well as its highly optimized memory management, which supports graceful degradation from in-memory to out-of-core algorithms for very large distributed datasets.
Flink is an alternative to those projects, although many people are using several engines on the same Hadoop cluster built on top of YARN, depending on the specific workload and taste.
At its core, Flink is a streaming engine, surfacing batch and streaming APIs. In contrast, at its core, Spark is at an in-memory batch engine that executes streaming jobs as a series of mini-batches. Compared to Storm, Flink streaming has a checkpointing mechanism with lower overhead, as well as an easy to use API. Certainly, Flink supports batch processing quite well. In fact, a streaming dataflow engine is a great match for batch processing, which is the approach that parallel databases (e.g., Impala) have been following.

Q11. Is Flink already used in production?

Volker Markl: Indeed, two companies already use Flink in production for both batch and stream processing, and a larger number of companies are currently trying out the system. For that reason, I am looking forward to the first annual Flink conference, called Flink Forward (, which will take place on Oct 12-13, 2015 in Berlin, where I am certain we will hear more about its use in production.

Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA, group at the Technische Universität Berlin (TU Berlin). Volker also holds a position as an adjunct full professor at the University of Toronto and is director of the research group “Intelligent Analysis of Mass Data” at DFKI, the German Research Center for Artificial Intelligence.
Earlier in his career, Dr. Markl lead a research group at FORWISS, the Bavarian Research Center for Knowledge-based Systems in Munich, Germany, and was a Research Staff member & Project Leader at the IBM Almaden Research Center in San Jose, California, USA. Dr. Markl has published numerous research papers on indexing, query optimization, lightweight information integration, and scalable data processing. He holds 18 patents, has transferred technology into several commercial products, and advises several companies and startups.
He has been speaker and principal investigator of the Stratosphere research project that resulted in the “Apache Flink” big data analytics system and is currently leading the Berlin Big Data Center ( Dr. Markl currently also serves as the secretary of the VLDB Endowment and was recently elected as one of Germany’s leading “digital minds” (Digitale Köpfe) by the German Informatics Society (GI).

A detailed Bio can be found at


MONDAY JAN 12, 2015, The Apache Software Foundation Announces Apache™ Flink™ as a Top-Level Project

Apache Flink Frequently Asked Questions (FAQ)

– Mirror of Apache Flink

Related Posts

– On Apache Ignite v1.0. Interview with Nikita Ivanov. ODBMS Industry Watch, February 26, 2015

– AsterixDB: Better than Hadoop? Interview with Mike Carey. ODBMS Industry Watch, October 22, 2014

Common misconceptions about SQL on Hadoop,

SQL over Hadoop: Performance isn’t everything…

Getting Up to Speed on Hadoop and Big Data.


Follow on Twitter: @odbmsorg


Jun 9 15

Data for the Common Good. Interview with Andrea Powell

by Roberto V. Zicari

“CABI has a proud history (we were founded in 1910) of serving the needs of agricultural researchers around the world, and it is fascinating to see how technology can now help to achieve our development mission. We can have much greater impact at scale these days on the lives of poor farmers around the world (on whom we are all dependent for our food) by using modern technology and by putting knowledge into the hands of those who need it the most.”–Andrea Powell

I have interviewed Andrea Powell,Chief Information Officer at CABI.
Main topic of the interview is how to use data and knowledge for the Common Good, specifically by solving problems in agriculture and the environment.


Q1. What is the main mission of CABI?

Andrea Powell: CABI’s mission is to improve people’s lives and livelihoods by solving problems in agriculture and the environment.
CABI is a not-for-profit, intergovernmental organisation with over 500 staff based in 17 offices around the world. We focus primarily on plant health issues, helping smallholder farmers to lose less of what they grow and therefore to increase their yields and their incomes.

Q2. How effective is scientific publishing in helping the developing world solving agricultural problems?

Andrea Powell: Our role is to bridge the gap between research and practice.
Traditional scientific journals serve a number of purposes in the scholarly communication landscape, but they are often inaccessible or inappropriate for solving the problems of farmers in the developing world. While there are many excellent initiatives which provide free or very low-cost access to the research literature in these countries, what is often more effective is working with local partners to develop and implement local solutions which draw on and build upon that body of research.
Publishers have pioneered innovative uses of technology, such as mobile phones, to ensure that the right information is delivered to the right person in the right format.
This can only be done if the underlying information is properly categorised, indexed and stored, something that publishers have done for many decades, if not centuries. Increasingly we are able to extract extra value from original research content by text and data mining and by adding extra semantic concepts so that we can solve specific problems.

Q3. What are the typical real-world problems that you are trying to solve? Could you give us some examples of your donor-funded development programs?

Andrea Powell: In our Plantwise programme, we are working hard to reduce the crop losses that happen due to the effects of plant pests and diseases. Farmers can typically lose up to 40% of their crop in this way, so achieving just a 1% reduction in such losses could feed 25 million more hungry mouths around the world. Another initiative, called mNutrition, aims to deliver practical advice to farming families in the developing world about how to grow more nutritionally valuable crops, and is aimed at reducing child malnutrition and stunting.

Q4. How do you measure your impact and success?

Andrea Powell: We have a strong focus on Monitoring and Evaluation, and for each of our projects we include a “Theory of Change” which allows us to measure and monitor the impact of the work we are doing. In some cases, our donors carry out their own assessments of our projects and require us to demonstrate value for money in measurable ways.

Q5. What are the main challenges you are currently facing for ensuring CABI’s products and services are fit for purpose in the digital age?

Andrea Powell: The challenges vary considerably depending on the type of customer or beneficiary.
In our developed world markets, we already generate some 90% of our income from digital products, so the challenge there is keeping our products and platforms up-to-date and in tune with the way modern researchers and practitioners interact with digital content. In the developing world, the focus is much more on the use of mobile phone technology, so transforming our content into a format that makes it easy and cheap to deliver via this medium is a key challenge. Often this can take the form of a simple text message which needs to be translated into multiple languages and made highly relevant for the recipient.

Q6. You have one of the world’s largest agricultural database that sits in a RDBMS, and you also have info silos around the company. How do you pull all of these information together?

Andrea Powell: At the moment, with some difficulty! We do use APIs to enable us to consume content from a variety of sources in a single product and to render that content to our customers using a highly flexible Web Content Management System. However, we are in the process of transforming our current technology stack and replacing some of our Relational Databases with MarkLogic, to give us more flexibility and scaleability. We are very excited about the potential this new approach offers.

Q7. How do you represent and model all of this knowledge? Could you give us an idea of how the data management part for your company is designed and implemented?

Andrea Powell: We have a highly structured taxonomy that enables us to classify and categorise all of our information in a consistent and meaningful way, and we have recently implemented a semantic enrichment toolkit, TEMIS Luxid® to make this process even more efficient and automated. We are also planning to build a Knowledge Graph based on linked open data, which will allow us to define our domain even more richly and link our information assets (and those of other content producers) by defining the relationships between different concepts.

Q8. What kind of predictive analytics do you use or plan to use?

Andrea Powell: We are very excited by the prospect of being able to do predictive analysis on the spread of particular crop diseases or on the impact of invasive species. We have had some early investigations into how we can use semantics to achieve this; e.g. if pest A attacks crop B in country C, what is the likelihood of it attacking crop D in country E which has the same climate and soil types as country C?

Q9. How do you intend to implement such predictive analytics?

Andrea Powell: We plan to deploy a combination of expert subject knowledge, data mining techniques and clever programming!

Q10. What are future strategic developments?

Andrea Powell: Increasingly we are developing knowledge-based solutions that focus on solving specific problems and on fitting into user workflows, rather than creating large databases of content with no added analysis or insight. Mobile will become the primary delivery channel and we will also be seeking to use mobile technology to gather user data for further analysis and product development.

Qx Anything else you wish to add?

Andrea Powell: CABI has a proud history (we were founded in 1910) of serving the needs of agricultural researchers around the world, and it is fascinating to see how technology can now help to achieve our development mission. We can have much greater impact at scale these days on the lives of poor farmers around the world (on whom we are all dependent for our food) by using modern technology and by putting knowledge into the hands of those who need it the most.

ANDREA POWELL,Chief Information Officer, CABI, United Kingdom.
I am a linguist by training (French and Russian) with an MA from Cambridge University but have worked in the information industry since graduating in 1988. After two and a half years with Reuters I joined CABI in the Marketing Department in 1991 and have worked here ever since. Since January 2015 I have held the position of Chief Information Officer, leading an integrated team of content specialists and technologists to ensure that all CABI’s digital and print publications are produced on time and to the quality standards expected by our customer worldwide. I am responsible for future strategic development, for overseeing the development of our technical infrastructure and data architecture, and for ensuring that appropriate information & communication technologies are implemented in support of CABI’s agricultural development programmes around the world.


– More information about how CABI is using MarkLogic can be found in this video, recorded at MarkLogic World San Francisco, April 2015.

Related Posts

Big Data for Good. ODBMS Industry Watch June 4, 2012. A distinguished panel of experts discuss how Big Data can be used to create Social Capital.

Follow on Twitter: @odbmsorg


Jun 2 15

Big Data and the financial services industry. Interview with Simon Garland

by Roberto V. Zicari

“The type of data we see the most is market data, which comes from exchanges like the NYSE, dark pools and other trading platforms. This data may consist of many billions of records of trades and quotes of securities with up to nanosecond precision — which can translate into many terabytes of data per day.”–Simon Garland

The topic of my interview with Simon Garland, Chief Strategist at Kx Systems, is Big Data and the financial services industry.


Q1. Talking about the financial services industry, what types of data and what quantities are common?

Simon Garland: The type of data we see the most is market data, which comes from exchanges like the NYSE, dark pools and other trading platforms. This data may consist of many billions of records of trades and quotes of securities with up to nanosecond precision — which can translate into many terabytes of data per day.

The data comes in through feed-handlers as streaming data. It is stored in-memory throughout the day and is appended to the on-disk historical database at the day’s end. Algorithmic trading decisions are made on a millisecond basis using this data. The associated risks are evaluated in real-time based on analytics that draw on intraday data that resides in-memory and historical data that resides on disk.

Q2. What are the most difficult data management requirements for high performance financial trading and risk management applications?

Simon Garland: There has been a decade-long arms race on Wall Street to achieve trading speeds that get faster every year. Global financial institutions in particular have spent heavily on high performance software products, as well as IT personnel and infrastructure just to stay competitive. Traders require accuracy, stability and security at the same time that they want to run lightning fast algorithms that draw on terabytes of historical data.

Traditional databases cannot perform at these levels. Column store databases are generally recognized to be orders of magnitude faster than regular RDBMS; and a time-series optimized columnar database is uniquely suited for delivering the performance and flexibility required by Wall Street.

Q3. And why is this important for businesses?

Simon Garland: Orders of magnitude improvements in performance will open up new possibilities for “what-if” style analytics and visualization; speeding up their pace of innovation, their awareness of real-time risks and their responsiveness to their customers.

The Internet of Things in particular is important to businesses who can now capitalize on the digitized time-series data they collect, like from smart meters and smart grids. In fact, I believe that this is only the beginning of the data volumes we will have to be handling in the years to come. We will be able to combine this information with valuable data that businesses have been collecting for decades.

Q4. One of the promise of Big Data for many businesses is the ability to effectively use both streaming data and the vast amounts of historical data that will accumulate over the years, as well as the data a business may already have warehoused, but never has been able to use. What are the main challenges and the opportunities here?

Simon Garland: This can seem like a challenge for people trying to put a system together from a streaming database; an in-memory database from a different vendor, and an historical database from yet another vendor. They then pull data from all of these applications into yet another programming environment. This method cannot give performance and long term is fragile and unmaintainable.

The opportunity here is for a database platform that unifies the software stack, like kdb+, that is robust, easily scalable and easily maintainable.

Q5. How difficult is to combine and process streaming, in-memory and historical data in real time analytics at scale?

Simon Garland: This is an important question. These functionalities can’t be added afterwards. Kdb+ was designed for streaming data, in-memory data and historical data from the beginning. It was also designed with multi-core and multi-process support from the beginning which is essential for processing large amounts of historical data in parallel on current hardware.

We were doing this for decades, even before multi-core machines existed — which is why Wall Street was an early adopter of our technology.

Q6. q programming language vs. SQL: could you please explain the main differences? And also highlight the Pros and cons of each.

Simon Garland: The q programming language is built into the database system kdb+. It is an array programming language that inherently supports the concepts of vectors and column store databases rather than the rows and records that traditional SQL supports.

The main difference is that traditional SQL doesn’t have a concept of order built in, whereas the q programming language does. Unlike traditional SQL, the language q contains a concept of order. This makes complete sense when dealing with time-series data.

Q is intuitive and the syntax is extremely concise, which leads to more productivity, less maintenance and quicker turn-around time.

Q7. Could you give us some examples of successful Big Data real time analytics projects you have been working on?

Simon Garland: Utility applications are using kdb+ for millisecond queries of tables with hundreds of billions of data points captured from millions of smart meters. Analytics on this data can be used for balancing power generation, managing blackouts and for billing and maintenance.

Internet companies with massive amounts of traffic are using kdb+ to analyze Googlebot behavior to learn how to modify pages to improve their ranking. They tell us that traditional databases simply won’t work when they have 100 million pages receiving hundreds of millions of hits per day.

In industries like pharmaceuticals, where decision-making is based on data that can be one day, one week or one month old, our customers and prospects say our column store database makes their legacy data warehouse software obsolete. It is many times faster on the same queries. The time needed for complex analyses on extremely large tables has literally been reduced from hours to seconds.

Q8. Are there any similarities in the way large data sets are used in different vertical markets such as financial service, energy & pharmaceuticals?

Simon Garland: The shared feature is that all of our customers have structured, time-series data. The scale of their data problems are completely different, as are their business use cases. The financial services industry, where kdb+ is an industry standard, demands constant improvements to real-time analytics.

Other industries, like pharma, telecom, oil and gas and utilities, have a different concept of time. They also often are working with smaller data extracts, which they often still consider “Big Data.” When data comes in one day, one week or one month after an event occurred, there is not the same sense of real-time decision making as in finance. Having faster results for complex analytics helps all industries innovate and become more responsive to their customers.

Q9. Anything else you wish to add?

Simon Garland: If we piqued your interest, we have a free, 32-bit version of kdb+ available for download on our web site.

Simon Garland, Chief Strategist, Kx Systems
Simon is responsible for upholding Kx’s high standards for technical excellence and customer responsiveness. He also manages Kx’s participation in the Securities Trading Analysis Center, overseeing all third-party benchmarking.
Prior to joining Kx in 2002, Simon worked at a database search engine company.
Before that he worked at Credit Suisse in risk management. Simon has developed software using kdb+ and q, going back to when the original k and kdb were introduced. Simon received his degree in Mathematics from the University of London and is currently based in Europe.


LINK to Download of the free 32-bit version of kdb+

Q Tips: Fast, Scalable and Maintainable Kdb+, Author: Nick Psaris

Related Posts

Big Data and Procurement. Interview with Shobhit Chugh. Source: ODBMS Industry Watch, Published on 2015-05-19

On Big Data and the Internet of Things. Interview with Bill Franks. Source: ODBMS Industry Watch, Published on 2015-03-09

On MarkLogic 8. Interview with Stephen Buxton. Source: ODBMS Industry Watch, Published on 2015-02-13

Follow on Twittwer: @odbmsorg

May 19 15

Big Data and Procurement. Interview with Shobhit Chugh

by Roberto V. Zicari

“The future of procurement lies in optimising cost and managing risk across the entire supplier base; not just the larger suppliers. Easy access to a complete view of supplier relationships across the enterprise will help those responsible for procurement to make favorable decisions, eliminate waste, increase negotiating leverage and manage risk better. “–Shobhit Chugh.

Data Curation, Big Data and the challenges and the future of Procurement/Supply Chain Management are among the topics of the interview with Shobhit Chugh, Product Marketing Lead at Tamr, Inc.


Q1. In your opinion, what is the future of Procurement/Supply Chain Management?

Shobhit Chugh: Procurement spend is one of the largest spend items for most companies; and supplier risk is one of the items that keeps CEOs of manufacturing companies up at night. Just recently, for example, an issue with a haptic device supplier created a shortage of Apple Watches just after the product’s launch.

At the same time, the world is changing: more data sources are available with increasing variety, and that keeps changing with frequent mergers and acquisitions. The future of procurement lies in optimizing cost and managing risk across the entire supplier base; not just the larger suppliers. Easy access to a complete view of supplier relationships across the enterprise will help those responsible for procurement to make favorable decisions, eliminate waste, increase negotiating leverage and manage risk better.

Q2. What are the current key challenges for Procurement/Supply Chain Management?

Shobhit Chugh: Companies looking for efficiency in their supply chains are limited by the siloed nature of procurement. The domain knowledge needed to properly evaluate suppliers typically resides deep in business units and suppliers are managed at ground level, preventing organizations from taking a global view of suppliers across the enterprise. Those people selecting and managing vendors want to drive terms that favor their company, but don’t have reliable cross-enterprise information on suppliers to make those decisions, and the cost of organizing and analyzing the data has been prohibitive.

Q3. What is the impact of Big Data on the Procurement/Supply Chain?

Shobhit Chugh: A brute force, manual effort to get a single view of suppliers on items such as terms, prices, risk metrics, quality, performance, etc. has traditionally been nearly impossible to do cost effectively. Even if the data exists within the organization, data challenges make it hard to consolidate information into a single view across business units. Rule-based approaches for unifying this data have scale limitations and are difficult to enforce given the distributed nature of procurement. And this does not even include the variety of external data sources that companies can take advantage of, which further increases the potential impact of big data.

Big data changes the situation by providing the ability to evaluate supplier contracts and performance in real time, and puts that intelligence in the hands of people working with suppliers so they can make better decisions. Big data holds significant promise, but only when data unification brings the decentralized data and expertise together to serve the greater good.

Q4. Why does this challenge call for data unification?

Shobhit Chugh: The quality of analysis coming out of procurement optimization is directly related to the volume and quality of data going in. Bringing that data together is no minor feat. In our experience, any individual in an organization can effectively use no more than ten percent of the organization’s data even under very good conditions. Given the distributed nature of procurement, that figure is likely dramatically lower in this situation. Cataloging the hundreds or thousands of internal and external data sources related to procurement provides the foundation for improved decision making.

Similarly, the ability to compare data is directly correlated to the ability to match data points in the same category or related to the same supplier. This is where top-down approaches often get bogged down. Part names, supplier names, site IDs and other data attributes need to be normalized and organized. The efficiency of big data is severely limited if like data sets in various formats aren’t brought together for meaningful comparison.

Q5. How is data unification related to Procurement/Supply Chain Management?

Shobhit Chugh: There are several ways for highly trained data scientists to combine a handful of sources for analysis. Procurement optimization across all suppliers is a markedly different challenge. Procurement data for a company could reside in dozens to thousands of places with very little similarity with regard to how the data is organized. Not only is this data hard for a centralized resource to find and collect, it is hard for non-experts to properly organize and prepare for analysis.
This data must be curated so that analysis returns meaningful results.

One thing I want to emphasize is that data unification is an ongoing activity rather than a one-time integration task. Companies that recognize this continue to extract the maximum value out of data, and are also able to adapt to opportunities to bring in more internal and external data sources when the opportunity presents itself.

Q6. Can you put that in the context of a real world example?

Shobhit Chugh: A highly diversified manufacturer we work with wanted a single view of suppliers across numerous information silos spanning multiple business units. A supplier master list would ultimately contain over a hundred thousand supplier records from many ERP systems. Just one business unit was maintaining over a dozen ERP systems, with new ERP systems regularly coming on line or being added through acquisitions. The list of suppliers also changed rapidly, making functions like deduplication nearly impossible to maintain. Additionally, the company wanted to integrate external data to enrich internal data with information on each supplier’s fiscal strength and structure.

A “bottom-up,” probabilistic approach to data integration proved to be more scalable than a traditional “top-down” manual approach, due to the sheer volume and variety of data sources. Specifically, the company leveraged our machine learning algorithms to continuously re-evaluate and remove potential duplicate entries, driving automation supported by expert guidance into a previously manual process performed by non-experts. The initial result was elimination of 33 percent of suppliers from the master list, just through deduplication.

The company then looked across multiple businesses’ governance systems for suppliers that were related through a corporate structure and identified a significant overlap. Using the same core master list, operational teams were able to treat supplier subsidiaries as different entities for payment purposes, while analytics teams got a global view of a supplier to ensure consistent payment terms. From hundreds of single-use sources, the company created a single view of suppliers with multiple important uses.

Q7. When you talk about data curation, who is doing the curation and for whom? Is it centralized?

Shobhit Chugh: Everyone responsible for a supplier relationship, and the corresponding data, has an interest in the completeness of the data pool, and an interest in the most complete analysis possible. They don’t have an interest in committing the time required to unify the data manually. Our approach is to use ever-improving machine learning to handle the bulk of data matching and rely on subject matter experts only when needed. Further, the system learns which experts to ask each time help is needed, depending on the situation. Once the data is unified, it is available for use by all, including data scientists and corporate leaders far removed from the front lines.

Q8. Do all data-enabled organizations need to hire the best data scientists they can find?

Shobhit Chugh: Yes, data-driven companies should create data-driven innovation, and non-obvious insights often take good data scientists who are tasked with looking beyond the next supplier for ways data can impact other areas of the business. Here, too the decentralized model of data unification has dramatic benefits.

The current scarcity of qualified data scientists will only deepen as the growth in demand is expected to far outpace the rate of qualified professionals entering the field. Everyone is looking to hire the best and brightest data scientists to get insights from their data, but relentless hiring is the wrong way to solve the problem. Data scientists spend 80 percent of their time finding and preparing data, and only twenty percent actually finding answers to critical business questions. Therefore, the better path to scaling data scientists is enabling the ones you have to spend more time on analysis rather than data preparation.

Q9. What is the ROI a company could expect from using data unification for procurement?

Shobhit Chugh: Procurement is an exciting area for data unification precisely because once data is unified, value can be derived using existing best practices, now with a much larger percentage of the supplier base.
Value includes better payment terms, cost savings, higher raw material and part quality and lower supplier risk.
Seventy-five to 80 percent of the value of procurement optimization strategies will come from smaller suppliers and contracts, and data unification unlocks this value.

Q10. What do you predict will be the top five challenges for procurement to tackle in the next two years?

Shobhit Chugh: Using data unification and powerful analysis tools, companies will begin to see immediate value from:
• Achieving “most favored” status from suppliers and eliminating poorly structured contracts where suppliers have multiple customers in your organization
• Build holistic relationships with supplier parent organizations based on the full scope of their subsidiaries’ commitments
• Eliminate rules-based approaches to supplier sourcing and other top-down strategies in favor of data-driven, bottom-up strategies that make use of expertise and data spread throughout the organization
• Embrace the variety of pressure points in procurement – price, delivery, quality, minimums, payment terms, risk, etc. – as ways to customize vendor relationships to suit each need rather than a fog that obscures the value of each contract
• Identify the internal procurement “rock stars” and winning strategies that drive the most value for your organization and replicate those ideas enterprise-wide

Qx. Anything else you wish to add?

Shobhit Chugh: The final component we haven’t discussed is the timing associated with these gains.
We’ve seen procurement optimization projects performed in days or weeks that unleash the vast untapped majority of data locked in previously unknown sources. Not long ago, similar projects focused on just the top suppliers took months and quarters. Addressing the full spectrum of suppliers in this way was not feasible. The combination of data unification and big data is perfectly suited to bringing value quickly and sustaining that value by staying on top of the continual tide of new data.

Shobhit Chugh leads product marketing for Tamr, which empowers organizations to leverage all of their data for analytics by automating the cataloging, connection and curation of “hard-to-reach” data with human-guided machine learning. He has spent his career in tech startups including High Start Group, Lattice Engines,Adaptly and Manhattan Associates. He has also worked as a consultant at McKinsey & Company’s Boston and New York offices, where he advised high tech and financial services clients on technology and sales and marketing strategy.
Shobhit holds an MBA from Kellogg School of Management, a Master’s of Engineering Management in Design from McCormick School of Engineering at Northwestern University, and a Bachelor of Technology in Computer Science from Indian Institute of Technology, Delhi.


Procurement: Fueling optimization through a simplified, unified view, White Paper Tamr (Link to Download , Registration required)

Data Curation at Scale: The Data Tamer System (LINK to .PDF)

Can We Trust Probabilistic Machines to Prepare Our Data? By Daniel Bruckner,

Smooth Sailing for Data Lakes: Hadoop/Hive + Data Curation, Tamr, FEATURED CONTENT, INSIGHTS,


Related Posts

On Data Curation. Interview with Andy Palmer, ODBMS Indutry Watch, January 14, 2015

On Data Mining and Data Science. Interview with Charu Aggarwal. ODBMS Industry Watch Published on 2015-05-12 Experts Notes
Selected contributions from experts panel:
Big data, big trouble.
Data Acceleration Architecture/ Agile Analytics.
Critical Success Factors for Analytical Models.
Some Recent Research Insights Operations Research as a Data Science Problem.
Data Wisdom for Data Science.

Follow on Twitter: @odbmsorg


May 12 15

On Data Mining and Data Science. Interview with Charu Aggarwal

by Roberto V. Zicari

“What is different in big data applications, is that sometimes the data is stored in a distributed sense, and even simple processing becomes more challenging” — Charu Aggarwal.

On Data Mining, Data Science and Big Data, I have interviewed Charu Aggarwal, Research Scientist at the IBM T. J. Watson Research Center, an expert in this area.


Q1. You recently edited two books: Data Classification: Algorithms and Applications and Data Clustering: Algorithms and Applications.
What are the main lessons learned in data classification and data clustering that you can share with us?

Charu Aggarwal: The most important lesson, which is perhaps true for all of data mining applications, is that feature extraction, selection and representation are extremely important. It is all too often that we ignore these important aspects of the data mining process.

Q2. How Data Classification and Data Clustering relate to each other?

Charu Aggarwal: Data classification is the supervised version of data clustering. Data clustering is about dividing the data into groups of similar points. In data classification, examples of groups of points are made available to you. Then, for a given test instance, you are supposed to predict which group this point might belong to.
In the latter case, the groups often have a semantic interpretation. For example, the groups might correspond to fraud/not fraud labels in a credit-card application. In many cases, it is natural for the groups in classification to be clustered as well. However, this is not always the case.
Some methods such as semi-supervised clustering/classification leverage the natural connections between these problems to provide better quality results.

Q3. Can data classification and data clustering be useful also for large data sets and data streams? If yes, how?

Charu Aggarwal: Data clustering is definately useful for large data sets, because clusters can be viewed as summaries of the data. In fact, a particular form of fine-grained clustering, referred to as micro-clustering, is commonly used for summarizing high-volume streaming data in real time. These summaries are then used for many different applications, such as first-story detection, novelty detection, prediction, and so on.
In this sense, clustering plays an intermediate role in enabling other applications for large data sets.
Classification can also be used to generate different types of summary information, although it is a little less common. The reason is that classification is often used as the end-user application, rather than as an intermediate application
like clustering. Therefore, big-data serves as a challenge and as an opportunity for classification.
It serves as a challenge because of obvious computational reasons. It serves as an opportunity because you can build more complex and accurate models with larger data sets without creating a situation, where the model inadvertently overfits to the random noise in the data.

Q4. How do you typically extract “information” from Big Data?

Charu Aggarwal: This is a highly application-specific question, and it really depends on what you are looking for. For example, for the same stream of health-care data, you might be looking for different types of information, depending on whether you are trying to detect fraud, or whether you are trying to discover clinical anomalies. At the end of the day, the role of the domain expert can never be discounted.
However, the common theme in all these cases is to create a more compressed, concise, and clean representation into one of the data types we all recognize and know how to process. Of course, this step is required in all data mining applications, and not just big data applications. What is different in big data applications, is that sometimes the data is stored in a distributed sense, and even simple processing becomes more challenging.
For example, if you look at Google’s original MapReduce framework, it was motivated by a need to efficiently perform operations that are almost trivial for smaller data sets, but suddenly become very expensive in the big-data setting.

Q5. What are the typical problems and scenarios when you cluster multimedia, text, biological, categorical, network, streams, and uncertain data?

Charu Aggarwal: The heterogeneity of the data types causes significant challenges.
One problem is that the different data types may often be mixed, as a result of which the existing methods can sometimes not be used directly. Some common scenarios in which such data types arise are photo/music/video-sharing (multimedia), healthcare (time-series streams and biological), and social networks. Among these different data types, the probabilistic (uncertain) data types does not seem to have graduated from academia into industry very well. Of course, it is a new area and there is a lot of active research going on. The picture will become clearer in a few years.

Q6. How effective are today ́s clustering algorithms?

Charu Aggarwal: Clustering problems have become increasingly effective in recent years because of advances in high-dimensional methods. In the past, when the data was very high-dimensional most existing methods work poorly because of locally irrelevant attributes and concentration effects. These are collectively referred to as the curse of dimensionality. Techniques such as subspace and projected clustering have been introduced to discover clusters in lower dimensional views of the data. One nice aspect of this approach is that some variations of it are highly interpretable.

Q7. What is in common between pattern recognition, database analytics, data mining, and machine learning?

Charu Aggarwal: They really do the same thing, which is that of analyzing and gleaning insights from data. It is just that the styles and emphases are different in various communities. Database folks are more concerned
about scalability. Pattern recognition and machine learning folks are somewhat more theoretical. The statistical folks tend to use their statistical models. The data mining community is the most recent one, and it was formed to create a common meeting ground for these diverse communities.
The first KDD conference was held in 1995, and we have come a long way since then towards integration. I believe that the KDD conference has played a very major role in the amalgamation of these communities. Today, it is actually possible for the folks from database and machine learning communities to be aware of each other’s work. This was not quite true 20 years ago.

Q8. What are the most “precise” methods in data classification?

Charu Aggarwal: I am sure that you will find experts who are willing to swear by a particular model. However, each model comes with a different set of advantages over different data sets. Furthermore, some models, such as univariate decision trees and rule-based methods, have the advantage of being interpretable even when they are outperformed by other methods. After all, analysts love to know about the “why” aside from the “what.”

While I cannot say which models are the most accurate (highly data specific), I can certainly point to the most “popular” ones today from a research point of view. I would say that SVMs, and neural networks (deep learning) are the most popular classification methods. However, my personal experience has been mixed.
While I have found SVMs to work quite well across a wide variety of settings, neural networks are generally less robust. They can easily over fit to noise or show unstable performance over small ranges of parameters. I am watching the debate over deep learning with some interest to see how it plays out.

Q9. When to use Mahout for classification? and What is the advantage of using Mahout for classification?

Charu Aggarwal: Apache Mahout is a scalable machine learning environment for data mining applications. One distinguishing feature of Apache Mahout is that it builds on top of distributed infrastructures like MapReduce, and enables easy building of machine learning applications. It includes libraries of various operations and applications.
Therefore, it reduces the effort of the end user beyond the basic MapReduce framework. It should be used in cases, where the data is large enough to require the use of such distributed infrastructures.

Q10. What are your favourite success stories in Data Classifications and/or Data Clustering?

Charu Aggarwal: One of my favorite success stores is in the field of high dimensional data, where I explored the effect of locally irrelevant dimensions and concentration effects on various data mining algorithms.
I designed a suite of algorithms for such high-dimensional tasks as clustering, similarity search, and outlier detection.
The algorithms continue to be relevant even today, and we have even generalized some of these results to big-data (streaming) scenarios and other application domains, such as the graph and text domains.

Qx Anything else you wish to add?

Charu Aggarwal: Data mining and data sciences are at exciting cross-roads today. I have been working in this field since 1995, and I have never seen as much excitement about data science in my first 15 years, as I have seen
in the last 5. This is truly quite amazing!

Charu C. Aggarwal is a Research Scientist at the IBM T. J. Watson Research Center in Yorktown Heights, New York.
He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from Massachusetts Institute of Technology in 1996.
His research interest during his Ph.D. years was in combinatorial optimization (network flow algorithms), and his thesis advisor was Professor James B. Orlin.
He has since worked in the field of performance analysis, databases, and data mining. He has published over 200 papers in refereed conferences and journals, and has applied for or been granted over 80 patents. He is author or editor of nine books.
Because of the commercial value of the above-mentioned patents, he has received several invention achievement awards and has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research Division Award (2008) for his scientific contributions to data stream research.
He has served on the program committees of most major database/data mining conferences, and served as program vice-chairs of the SIAM Conference on Data Mining, 2007, the IEEE ICDM Conference, 2007, the WWW Conference 2009, and the IEEE ICDM Conference, 2009. He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering Journal from 2004 to 2008. He is an associate editor of the ACM TKDD Journal, an action editor of the Data Mining and Knowledge Discovery Journal, an associate editor of the ACM SIGKDD Explorations, and an associate editor of the Knowledge and Information Systems Journal.
He is a fellow of the IEEE for “contributions to knowledge discovery and data mining techniques”, and a life-member of the ACM.



Data Classification: Algorithms and Applications, Editor: Charu C. Aggarwal, Publisher: CRC Press/Taylor & Francis Group, 978-1-4665-8674-1, © 2014, 707 pages

Data Clustering: Algorithms and Applications, Edited by Charu C. Aggarwal, Chandan K. Reddy, August 21, 2013 by Chapman and Hall/CRC

– MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
Appeared in:OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. Download: PDF Version

Related Posts

Follow on Twitter: @odbmsorg