“Our evaluation of IMDSs determined that eXtremeDB-64 IMDS outperformed other IMDSs in terms of performance and scalability.”–Manjul Maharishi.
I have interviewed Manjul Maharishi, Vice President (telecom software development) at Transaction Network Services.
They use In-Memory Database technology for managing real-time community networks in the world.
Q1. What is the mission of Transaction Network Services (TNS)?
Manjul Maharishi: Transaction Network Services manages many of the largest real-time community networks in the world, enabling industry participants to simply and securely interact and transact with other businesses, to access the data and applications they need, over managed and secure communications platforms. TNS’ existing footprint supports millions of connections and access to critical databases, enabling its customers through a single connection, a “one-to-many and many-to-many” global platform, securely blending private and public networking.
Q2. What is TNS’s Carrier ENUM Registry? And for what is it useful for?
Manjul Maharishi: Carrier ENUM Registry is a product offering for telecom carriers that provides information critical to the accurate routing and billing of inter-carrier communications, such as voice and mobile data services.
Carrier ENUM Registry addresses a challenge that is posed every time you place a phone call or send a text message: how, in the split second of latency that is deemed acceptable, will the call or message find the way to its recipient?
As a solution, Carrier ENUM Registry makes available an up-to-date, portability-corrected image of the entire public dial plan as well as authoritative information sourced directly from the service provider that “owns” (in telecom parlance, has the “right-to-use”) a particular telephone number. This is provided in the form of two registries, or databases:
• Number Identity Registry is a massive repository of global telephone numbers and carrier-of-record information that identifies which service provider a telephone number was allocated to for end-user assignment. In response to lookups the registry returns a Carrier Identifier (which can be in the form of a Service Provider Identifier (SPID), and/or a Mobile Country Code+Mobile Network Code(MCC+MNC)) and when available, the Location Routing Number (LRN) of ported and pooled numbers.
• Network Routing Directory is a multi-party shared registration system that furnishes service providers with sophisticated data-sharing capabilities featuring safeguard controls designed to uphold data-sharing policies. Using our secure portal service, providers self-administer data and selectively grant access, in whole or part, to trading partners (and vice versa).
Q3. Who are the customers using Carrier ENUM Registry?
Manjul Maharishi: The customers are telecom carriers – large, small and in-between – worldwide.
They are mobile, landline and IP-based, including some ISPs (Internet Service Providers) and cable MSOs (multiple system operators) that offer phone service, as well as “pure” VoIP providers. Most query the Carrier ENUM Registry deployment hosted at TNS’ facility in the US, but some host the application and database on their own premises.
Q4. What services do you support?
Manjul Maharishi: Carrier ENUM emerged as a service to connect the public switched telephone network (PSTN) and new IP-based networks, by resolving phone numbers to IP addresses and services. It also provided a bridge between IP-based carriers. For example, with a multi-vendor database of routes, users and phone numbers available, a caller on IP-based Network A could communicate with a user of IP-based Network B without routing calls across the PSTN (which would incur costs and may require avoidable transcoding).
Over time, though, Carrier ENUM Registry has gained complexity along with new features, and does much more than bridging between carriers. Supported services now include number portability, IP-peering between telephone service providers, SMS/MMS (aka “text messages”) routing, unbundling of services (allowing messaging to be offered separately from voice, for example), customized views of data, routing based on time/destination/origination, and more. These services have added complexity to TNS’ Carrier ENUM Registry business logic and have caused its databases to grow larger and the routing logic to become more complex.
Carriers can pick and choose from the various Carrier ENUM Registry features, to solve their particular challenges.
One of the biggest use cases in demand now is identifying the right carrier to terminate an SMS or MMS when number portability is involved in the host country.
Q5. What kind of real-time performance demands does Carrier ENUM Registry need to satisfy?
Manjul Maharishi: To customers, we commit to providing a response from our system within 10 milliseconds for 95% of the queries. Please note that a single customer query can result in dozens to a couple of hundred individual table queries based on the routing logic and services subscribed. However, largely through the use of in-memory database system (IMDS) technology for data management, we have been able to have a much lower variance in the query responses and a higher degree of predictability. Our typical average response to a customer query is less than 2 msec. These numbers only reflect the latency introduced by our platform, i.e. the time difference between when we receive the query and when we respond back.
The network latency – the time when the query leaves the customer network and when they receive the response – is larger (typical US cross-country network latency is 60-100 msec). An industry norm for the maximum acceptable time from when a subscriber dials digits to when they hear a ringing tone back is ~150-200 msecs, beyond which the “dead air/silence” becomes noticeable for the subscriber. However, for international calls, people do tend to be more tolerant of such post-dial delays.
Q6. Can you give an overview of the system architecture and toolset used to handle the increasingly complex business logic and growing data volume?
Manjul Maharishi: In order to handle the growing amount of stored data, we use general-purpose off-the-shelf Linux servers. This allows us to take advantage of industry-wide gains in processing power, memory and performance, as well as eliminate any dependence upon specific vendors for a software/hardware upgrade cycle. Currently, the systems are running on dual CPU, 6- and 8-core processors.
For data management we use eXtremeDB-64, the 64-bit edition of McObject’s eXtremeDB In-Memory Database System (IMDS). The system is architected such that each server stores the entire database, and customer queries are load balanced across a set of such servers. Accordingly, the platform is easily scaled by adding new servers as needed. Apart from offering this service as a cloud-based offering (“Central Replica”), we also offer the service as customer-premise deployment model (“Local Replica”) whereby the customers can gain from a much lower round-trip time (RTT) by avoiding network-latency. The TNS network operations center (NOC) monitors key performance indicators of our Central and Local Replica servers on a 24×7 basis, and we have agreements in place with our customers to scale up the platform by adding more servers if needed.
With the performance provided by eXtremeDB-64, we haven’t had a need to partition the data set in order to meet our commitments. We do use the database system’s Patricia trie indices to reduce the number of lookups required on certain tables, and work through the business logic to narrow down the search results to a manageable number early in the business rules processing.
In terms of development tools, we are developing in C++ using eXtremeDB’s C/C++ API instead of accessing via an SQL API, and this contributes to lower application latency. We develop software using Agile methodology with Continuous Integration that has nightly builds with a suite of automated tests executing during these builds. We also incorporate code coverage, leak detection and profiling as part of this Continuous Integration.
Q7. Can you tell more about how Carrier ENUM Registry meets its real-time data access requirements? Did it move to in-memory database technology recently or has this always been a feature?
Manjul Maharishi: The system architecture keeps data needed for real-time queries in memory, where it can be accessed quickly. Early versions of Carrier ENUM Registry accomplished searches using in-memory database code developed in-house for the application. However, TNS recognized several years ago that with the increasingly complex queries and higher data volumes, Carrier ENUM Registry would be better-served by an off-the-shelf in-memory database system (IMDS) that provides flexibility while scaling to hundreds of millions and even billions of records. After researching IMDSs, we chose the 64-bit eXtremeDB-64 and the new Carrier ENUM Registry version incorporating eXtremeDB-64 launched in 2013.
Currently, the system holds a master or archival data set in Oracle Enterprise DBMSs, with the data used for real-time lookups hosted “downstream” in eXtremeDB-64. Each downstream server hosts the entire data set used by the application; this data set consists of three separate (i.e. with unique schemas and data) databases with a combined size of 120 GB.
Two of the databases managed by eXtremeDB-64 on each server are “pure” in-memory databases while the third utilizes McObject’s eXtremeDB Fusion technology to include some persistent (on-disk) data storage.
Q8. Why did you choose eXtremeDB-64 from the field of available IMDSs and what has been your experience been using it?
Manjul Maharishi: Our evaluation of IMDSs determined that eXtremeDB-64 IMDS outperformed other IMDSs in terms of performance and scalability. Among other test findings, TNS determined that eXtremeDB’s performance exceeded 2 million queries per second with a 10 million-row database. When TNS upped the challenge by increasing the test database size 3000% (to 300 million records), eXtremeDB’s responsiveness fell only minimally, validating the near-linear scalability results documented in McObject’s published benchmarks. TNS’ platform for these tests consisted of Intel Xeon X5570 2.93 GHz hardware, with 8 cores and hyper-threading enabled, running Red Had Enterprise Linux 4, with 72 GB RAM.
Using eXtremeDB-64 in production with Carrier ENUM Registry has borne out our expectations: the database system meets current needs while providing room for future growth in both database size and complexity of application features.
Q9. You mentioned the use of the Patricia trie index in your database. Can you elaborate on the advantage it provides?
Manjul Maharishi: Support for the Patricia trie database index is another key eXtremeDB-64 feature (along with in-memory data storage) that enables Carrier ENUM Registry to meet its performance goals. The name of this specialized index derives from “Practical Algorithm To Retrieve Information Coded In Alphanumeric” and “reTRIEval”. Unlike the widely used B-tree index – which can also be used for finding keys with a specified prefix but can require multiple iterations to find the longest prefix match if there are multiple prefix matches in the index – the Patricia trie excels in searching for the longest prefixes of a specified value.
This approach meshes well with the unique nature of Carrier ENUM Registry’s data, and its queries. Phone numbers serve as keys for the searches that are performed when a call is placed. The key is stored on individual numbers, blocks or ranges.
A block consists of phone numbers in a quantity ranging from 1,000 to 10,000. For example, it could be the number 703667 and four additional digits ranging from 0000 to 9999. A range is a subset of a block. The key would be applied to an individual number, for example, when that number was ported from another carrier and does not fall into a large block of numbers serviced by the company using TNS’ application.
In most of the cases, it is not known beforehand if the number being queried has been “ported out” of a block or not, so the application (in the absence of the Patricia trie) would have to make multiple queries – starting with the most specific match and then dropping the least-significant digit one-by-one till a match was found. With the Patricia trie, there is only one iteration within the original query, which is much less taxing in performance terms and greatly simplifies the application logic.
Q10. Are there other aspects of your approach to data management that you’d like to mention?
Manjul Maharishi: The hybrid storage capability of eXtremeDB Fusion, mentioned above, gives us useful flexibility. eXtremeDB Fusion enables the developer to specify in-memory or persistent storage for record types within a database. Storing data on a hard disk drive (HDD) or solid state drive (SSD) has two benefits: it reduces memory demands, thereby helping us stay within servers’ maximum memory capacities (this was our primary reason for using eXtremeDB Fusion), and byte for byte, persistent storage is less expensive than memory.
We first used eXtremeDB’s hybrid storage to manage a large set of meta-data for mobile handsets such as device types, model names, dates of activation, etc. This information is used by a non-call-processing application and is looked up less frequently than Carrier ENUM Registry’s real-time routing data, so we were okay with the higher latency and variance in response that is introduced by disk-based access.
We are now expanding our use of hybrid storage to add some additional information (such as mobile device information and capabilities) to stored phone numbers, in order to enhance the communication between two subscribers – for example, by enabling features such as HD-voice, Rich Communications Services (RCS), etc. These features can result in substantially increasing the database size and memory footprint required, and eXtremeDB Fusion allows us to easily configure which portions of the data set are kept in memory and which ones are kept on persistent storage with a configurable subset cached in memory – thus allowing us to store some of the less heavily used dataset in SSDs or regular HDDs, while still maintaining the high performance required for the bulk of the transactions.
In his role as Vice President of Telecom Software Development at TNS, Manjul Maharishi is responsible for overseeing architecture, design, development and testing for all of the Products and Services offered by TNS’ Telecom Services Division. These include several massively sized Telecom Databases (serving Number Portability, Toll Free, Call Routing and Calling Name services), 3G/4G Roaming Hubs, associated Clearing and Settlement services and Data Analytics.
Prior to joining TNS, Manjul has held senior technical management positions at VeriSign and Lucent Technologies working in similar areas, including building the industry’s first widely deployed Softswitch while at Lucent Technologies.
Follow ODBMS.org on Twitter: @odbmsorg
“Cybersecurity is growing in importance with Obama, Xi and Cameron having announced major efforts to gain better control over the Internet. Dataflow computing enables computation as a bump-in-the-wire without disturbing the flow of packets. Building gateways out of DFEs will significantly support the Cybersecurity agenda for years to come.”–Oskar Mencer
I have interviewed Oskar Mencer, CEO and Founder at Maxeler Technologies. Main topic of the interview is Big Data and the Networking industry, and what Maxeler Technologies is contributing in this market.
Q1. What are data flow computers and Dataflow Engines (DFEs)? What are they typically useful for?
Oskar Mencer: Dataflow computers are highly efficient systems for computational problems with large amounts of structured and unstructured data and significant mission critical computation. DFEs are units within dataflow computers which currently hold up to 96GB of memory and provide in the order of 10K parallel operations. To put that into perspective, for some tasks, a DFE has the equivalent compute capability of a farm of several normal computers, but at the fraction of the price and energy consumption.
Q2. What is data flow analytics? and why is it important?
Oskar Mencer: Dataflow analytics is a software stack on top of Dataflow computers, providing powerful compute constructs on large datasets. Dataflow analytics is a potential answer to the challenges of Big Data and the Internet of Things.
Q3. What is a programmable data plane and how can one create secure storage with it?
Oskar Mencer: Software Defined Networking is all about the programmable control plane. Maxeler’s programmable data plane is the next step in the transformation of the Networking industry.
Q4. What are the main challenges for financial institutions who need to analyze and process massive quantities of information instantly from various sources in order to make better trading decisions?
Oskar Mencer: Today’s financial institutions have a major challenge from new legislation and requirements imposed by governments. Technology can solve some of the issues, but not all of them. On the trading side, whoever manages to process more data and derive more predictive capability from it, has a better position in the marketplace. Trading is becoming more complex and more regulated, and Maxeler’s Technology, in particular as it applies to exchanges, is starting to make a significant difference in the field, helping to push the state-of-the-art while simultaneously making finance safer.
Q5. Juniper Networks announced QFX5100-AA, a new application acceleration switch, and QFX-PFA, a new packet flow accelerator module. How do they plan to use Maxeler Technologies’ dataflow computing?
Oskar Mencer: The Application Acceleration module is based on Maxeler DFEs and programmble with Maxeler dataflow programming tools and infrastructure. The variety of networking applications this enables is tremendous, as is evident from our App gallery, which includes Apps for the Juniper switch .
Q6. What are the advantages of using a Spark/Hadoop appliance using a Juniper switch with programmable data plane?
Oskar Mencer: With a Juniper switch with a programmable dataplane, one could cache answers, compute in the network, optimize and merge maps, and generally make Spark/Hadoop deployment more scalable and more efficient.
Q7. Do you see a convergence of computer, networking and storage via Dataflow Engines (DFEs)?
Oskar Mencer: Indeed, DFEs provide efficiency at the core of networking, storage as well as compute. Dataflow computing has the potential to unify computation, the movement of data and the storage of data into a single system to solve the largest Big data analytics challenges that lie ahead.
Q8. Maxeler has been mentioned in a list of major HPC applications that had an impact on Quality of Life and Prosperity. Could you please explain what is special about this HPC application?
Oskar Mencer: Maxeler solutions provide competitive advantage and help in situations with mission critical challanges. In 2011 just after the hight of the credit crisis, Maxeler won the American Finance Technology Award with JP Morgan for applying dataflow computing to credit derivatives risk computations. Dataflow computing is a good solution for challenges where computing matters.
Q9. Big Data for the Common Good. What is your take on this?
Oskar Mencer: Big Data is a means to an end. Common good arises from bringing more predictability and stability into our lives. For example, many marriages have been saved by the availability of Satnav technology in cars, clearly a Big Data challenge. Medicine is an obvious Big Data challenge. Curing a patient is as much a Big Data challenge as fighting crime, and government in general. I see Maxeler’s dataflow computing technology as a key opportunity to address the Big Data challenges of today and tomorrow.
Qx Anything else you wish to add?
Oskar Mencer: Cybersecurity is growing in importance with Obama, Xi and Cameron having announced major efforts to gain better control over the Internet. Dataflow computing enables computation as a bump-in-the-wire without disturbing the flow of packets. Building gateways out of DFEs will significantly support the Cybersecurity agenda for years to come.
Oskar Mencer is CEO and Founder at Maxeler Technologies.
Prior to founding Maxeler, Oskar was Member of Technical Staff at the Computing Sciences Center at Bell Labs in Murray Hill, leading the effort in “Stream Computing”. He joined Bell Labs after receiving a PhD from Stanford University. Besides driving Maximum Performance Computing (MPC) at Maxeler, Oskar was Consulting Professor in Geophysics at Stanford University and he is also affiliated with the Computing Department at Imperial College London, having received two Best Paper Awards, an Imperial College Research Excellence Award in 2007 and a Special Award from Com.sult in 2012 for “revolutionising the world of computers”.
Follow ODBMS.org on Twitter: @odbmsorg
“One common struggle for data-driven enterprises is managing unnecessarily complicated data workflows with bloated ETL pipelines and a lack of native system integration.”– John Leach
I have interviewed John Leach, CTO & Cofounder Splice Machine. Main topics of the interview are Hadoop, Big Data integration and what Splice Machine has to offer in this space. Monte Zweben, CEO of Splice Machine also contributed to the interview.
Q1. What are the Top Ten Pitfalls to Avoid in a SQL-on-Hadoop Implementation?
John Leach, Monte Zweben:
1. Individual record lookups. Most SQL-on-Hadoop engines are designed for full table scans in analytics, but tend to be too slow for the individual record lookups and ranges scan used by operational applications.
2. Dirty Data. Dirty data is a problem for any system, but it is compounded in Big Data, often resulting in bad reports and delays to reload an entire data set.
3. Sharding. It can be difficult to know what key to distribute data and the right shard size. This results in slow queries, especially for large joins or aggregations.
4. Hotspotting. This happens when data becomes too concentrated in a few nodes, especially for time series data. The impact is slow queries and poor parallelization.
5. SQL coverage. Limited SQL dialects will make it so you can’t run queries to meet business needs. You’ll want to make sure you do your homework. Compile the list of toughest queries and test.
6. Concurrency. Low concurrency can result in the inability to power real-time apps, handle many users, support many input sources, and deliver reports as updates happen.
7. Columnar. Not all columnar solutions are created equally. Besides columnar storage, there are many other optimizations, such as vectorization and run length encoding that can have a big impact on analytic performance. If your OLAP queries run slower, common with large joins and aggregations, this will result in poor productivity. Queries may take minutes or hours instead of seconds. On the flip-side is using columnar when you need concurrency and real-time.
8. Node Sizing. Do your homework and profile your workload. Choosing the wrong node size (e.g., CPU cores, memory) can negatively impact price/performance and create performance bottlenecks.
9. Brittle ETL on Hadoop. With many SQL-on-Hadoop solutions being unable to provide update or delete capabilities without a full data reload, this can cause a very brittle ETL that will require restarting your ETL pipeline because of errors or data quality issues. The result is a missed ETL window and delayed reports to business users.
10. Cost-Based Optimizer. A cost-based optimizer improves performance by selecting the right join strategy, the right index, and the right ordering. Some SQL-on-Hadoop engines have no cost-based optimizer or relatively immature ones that can result in poor performance and poor productivity, as well as manual tuning by DBAs.
Q2. In your experience, what are the most common problems in Big Data integration?
John Leach, Monte Zweben: Providing users access to data in a fashion they can understand and at the moment they need it, while ensuring quality and security, can be incredibly challenging.
The volume and velocity of data that businesses are churning out, along with the variety of different sources, can pose many issues.
One common struggle for data-driven enterprises is managing unnecessarily complicated data workflows with bloated ETL pipelines and a lack of native system integration. Businesses may also find their skill sets, workload, and budgets over-stretched by the need to manage terabytes or petabytes of structured and unstructured data in a way that delivers genuine value to business users.
When data is siloed and there is no solution put into place, businesses can’t access the real-time insights they need to make the best decisions for their business. Performance goes down, headaches abound and cost goes way up, all in the effort to manage the data. That’s why a Big Data integration solution is a prerequisite for getting the best performance and the most real-time insights, at the lowest cost.
Q3. What are the capabilities of Hadoop beyond data storage?
John Leach, Monte Zweben: Hadoop has a very broad range of capabilities and tools:
– Oozie for workflow
– Pig for scripting
– Mahout or SparkML for machine learning
– Kafka and Storm for streaming
– Flume and Sqoop for integration
– Hive, Impala, Spark, and Drill for SQL analytic querying
– HBase for NoSQL
– Splice Machine for operational, transactional RDBMS
Q4. What programming skills are required to handle application development around Big Data platforms like Hadoop?
John Leach, Monte Zweben: To handle application development on Hadoop, individuals have choices to go raw Hadoop or SQL-on-Hadoop. When going the SQL route, very little new skills are required and developers can open connections to an RDBMS on Hadoop just like they used to do on Oracle, DB2, SQLServer, or Teradata. Raw HAdoop application developers should know their way around the core components of the Hadoop stack–such as HDFS, MapReduce, Kafaka, Storm, Oozie, Hive, Pig, HBase, and YARN. They should also be proficient in Java.
Q5. What are the current challenges for real-time application deployment on Hadoop?
John Leach, Monte Zweben: When we talk about real-time at Splice Machine, we’re focused on applications that require not only real-time responses to queries, but also real-time database updates from a variety of data sources. The former is not all that uncommon on Hadoop; the latter is nearly impossible for most Hadoop-based systems.
Deploying real-time applications on Hadoop is really a function of moving Hadoop beyond its batch processing roots to be able to handle real-time database updates with high concurrency and transactional integrity. We harness HBase along with a lockless snapshot isolation design to provide full ACID transactions across rows and tables.
This technology enables Splice Machine to execute the high concurrency of transactions required by real-time applications.
Q6. What is special about Splice Machine auto-sharding replication and failover technology?
John Leach, Monte Zweben: As part of its automatic auto-sharding, HBase horizontally partitions or splits each table into smaller chunks or shards that are distributed across multiple servers. Using the inherent failover and replication capabilities of HBase and Hadoop, Splice Machine can support applications that demand high availability.
HBase co-processors are used to embed Splice Machine in each distributed HBase region (i.e., data shard). This enables Splice Machine to achieve massive parallelization by pushing the computation down to each distributed data shard without any overhead of MapReduce.
Q7. How difficult is it for customers to migrate from legacy databases to Splice Machine?
John Leach, Monte Zweben: Splice Machine offers a variety of services to help businesses efficiently deploy the Splice Machine database and derive maximum value from their investment. These services include both implementation consulting and educational offerings delivered by our expert team.
Splice Machine has designed a Safe Journey program to significantly ease the effort and risk for companies migrating to a Splice Machine database. The Safe Journey program includes a proven methodology that helps choose the right workloads to migrate, implements risk-mitigation best practices, and includes commercial tools that automate most of the PL/SQL conversion process.
This is not to suggest that all legacy databases will convert to a Hadoop RDBMS.
The best candidates will typically have over 1TB of data, which often leads to cost and scaling issues in legacy databases.
Q8. You have recently announced partnership with Talend, mrc (michaels, ross & cole ltd.) and RedPoint Global. Why Talend, mrc, and RedPoint Global? What is the strategic meaning of these partnerships for Splice Machine?
John Leach, Monte Zweben: Our uptick in recent partnerships demonstrates the tremendous progress our team has made over the past year. We have been working relentlessly to develop the Splice Machine Hadoop RDBMS into a fully enterprise-ready database that can replace legacy database systems.
The demand for programming talent to handle application development is growing faster than the supply of skilled talent, especially around newer platforms like Hadoop. We partnered with mrc to give businesses a solution that can speed real-time application deployment on Hadoop with the staff and tools they currently have, while also offering future-proof applications over a database that scales to meet increasing data demands.
We partnered with Talend to bring our customers the benefit of two different approaches for managing data integration affordable and at scale. Talend’s rich capabilities including drag and drop user interface, and adaptable platform allow for increased productivity and streamlined testing for faster deployment of web, mobile, OLTP or Internet of Things applications.
And finally, we integrated and certified our Hadoop RDBMS on RedPoint’s Convergent Marketing Platform™ to create a new breed of solution for marketers. With cost-efficient database scale-out and real-time cross-channel execution, the solution enables enterprises to future-proof their marketing technology investment through affordable access to all their data (social, mobile, click streams, website behaviors, etc.) across a proliferating and ever-changing list of channels. Furthermore, it complements any existing Hadoop deployment, including those on the Cloudera, MapR and Hortonworks distributions.
Q9. How is Splice Machine working with Hadoop distribution partners –such as MapR, Hortonworks and Cloudera?
John Leach, Monte Zweben: Since Splice Machine does not modify HBase, it can be used with any standard Hadoop distribution that includes HBase, including Cloudera, MapR and Hortonworks. Splice Machine enables enterprises using these three companies to tap into real-time updates with transactional integrity, an important feature for companies looking to become real-time, data-driven businesses.
In 2013, Splice Machine partnered with MapR to enable companies to use the MapR distribution for Hadoop to build their real time, SQL-on-Hadoop applications. In 2014, we joined the Cloudera Connect Partner Program, after certifying on CDH 5. We are working closely with Cloudera to maximize the potential of its full suite of Hadoop-powered software and our unique approach to real-time Hadoop.
That same year, we joined Hortonworks Technology Partner program. This enabled our users to harness innovations in management, provisioning and security for HDP deployments. For HDP users, Splice Machine enables them to build applications that use ANSI-standard SQL and support real-time updates with transactional integrity, allowing Hadoop to be used in both OLTP and OLAP applications.
Earlier this year, we were excited to achieve Hortonworks® Data Platform (HDP™) Certification. With the HDP certification, our customers can leverage the pre-built and validated integrations between leading enterprise technologies and the Hortonworks Data Platform, the industry’s only 100-percent open source Hadoop distribution, to simplify and accelerate their Splice Machine and Hadoop deployments.
Q10 What are the challenges of running online transaction processing on Hadoop?
John Leach, Monte Zweben: With its heritage as a batch processing system, Hadoop does not provide the transaction support required by online transaction processing. Transaction support can be tricky enough to implement for shared-disk RDBMSs such as Oracle, but it becomes far more difficult to implement in distributed environments such as Hadoop. A distributed transactional model requires high-levels of coordination across a cluster with too much overhead, while simultaneously providing high performance for a high concurrency of small read and writes, high-speed ingest, and massive bulk loads. We prove this by being able to run the TPC-C benchmark at scale.
Splice Machine met those requirements by using distributed snap isolation, a Multi-Version Concurrency Control model that delivers lockless, and high-concurrency transactional support. Splice Machine extended research from Google’s Percolator project, Yahoo Lab’s OMID project, and the University of Waterloo’s HBaseSI project to develop its own patent-pending, distributed transactions.
John Leach – CTO & Cofounder Splice Machine
With over 15 years of software experience under his belt, John’s expertise in analytics and BI drives his role as Chief Technology Officer. Prior to Splice Machine, John founded Incite Retail in June 2008 and led the company’s strategy and development efforts. At Incite Retail, he built custom Big Data systems (leveraging HBase and Hadoop) for Fortune 500 companies.
Prior to Incite Retail, he ran the business intelligence practice at Blue Martini Software and built strategic partnerships with integration partners. John was a key subject matter expert for Blue Martini Software in many strategic implementations across the world. His focus at Blue Martini was helping clients incorporate decision support knowledge into their current business processes utilizing advanced algorithms and machine learning.
John received dual bachelor’s degrees in biomedical and mechanical engineering from Washington University in Saint Louis. Leach currently is the organizer for the Saint Louis Hadoop Users Group and is active in the Washington University Elliot Society.
Monte Zweben – CEO & Cofounder Splice Machine
A technology industry veteran, Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program.
Monte then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit.
In 1998, Monte was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA.
Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings.
Zweben currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.
“The type of data we see the most is market data, which comes from exchanges like the NYSE, dark pools and other trading platforms. This data may consist of many billions of records of trades and quotes of securities with up to nanosecond precision — which can translate into many terabytes of data per day.”–Simon Garland
Q1. Talking about the financial services industry, what types of data and what quantities are common?
Simon Garland: The type of data we see the most is market data, which comes from exchanges like the NYSE, dark pools and other trading platforms. This data may consist of many billions of records of trades and quotes of securities with up to nanosecond precision — which can translate into many terabytes of data per day.
The data comes in through feed-handlers as streaming data. It is stored in-memory throughout the day and is appended to the on-disk historical database at the day’s end. Algorithmic trading decisions are made on a millisecond basis using this data. The associated risks are evaluated in real-time based on analytics that draw on intraday data that resides in-memory and historical data that resides on disk.
Q2. What are the most difficult data management requirements for high performance financial trading and risk management applications?
Simon Garland: There has been a decade-long arms race on Wall Street to achieve trading speeds that get faster every year. Global financial institutions in particular have spent heavily on high performance software products, as well as IT personnel and infrastructure just to stay competitive. Traders require accuracy, stability and security at the same time that they want to run lightning fast algorithms that draw on terabytes of historical data.
Traditional databases cannot perform at these levels. Column store databases are generally recognized to be orders of magnitude faster than regular RDBMS; and a time-series optimized columnar database is uniquely suited for delivering the performance and flexibility required by Wall Street.
Q3. And why is this important for businesses?
Simon Garland: Orders of magnitude improvements in performance will open up new possibilities for “what-if” style analytics and visualization; speeding up their pace of innovation, their awareness of real-time risks and their responsiveness to their customers.
The Internet of Things in particular is important to businesses who can now capitalize on the digitized time-series data they collect, like from smart meters and smart grids. In fact, I believe that this is only the beginning of the data volumes we will have to be handling in the years to come. We will be able to combine this information with valuable data that businesses have been collecting for decades.
Q4. One of the promise of Big Data for many businesses is the ability to effectively use both streaming data and the vast amounts of historical data that will accumulate over the years, as well as the data a business may already have warehoused, but never has been able to use. What are the main challenges and the opportunities here?
Simon Garland: This can seem like a challenge for people trying to put a system together from a streaming database; an in-memory database from a different vendor, and an historical database from yet another vendor. They then pull data from all of these applications into yet another programming environment. This method cannot give performance and long term is fragile and unmaintainable.
The opportunity here is for a database platform that unifies the software stack, like kdb+, that is robust, easily scalable and easily maintainable.
Q5. How difficult is to combine and process streaming, in-memory and historical data in real time analytics at scale?
Simon Garland: This is an important question. These functionalities can’t be added afterwards. Kdb+ was designed for streaming data, in-memory data and historical data from the beginning. It was also designed with multi-core and multi-process support from the beginning which is essential for processing large amounts of historical data in parallel on current hardware.
We were doing this for decades, even before multi-core machines existed — which is why Wall Street was an early adopter of our technology.
Q6. q programming language vs. SQL: could you please explain the main differences? And also highlight the Pros and cons of each.
Simon Garland: The q programming language is built into the database system kdb+. It is an array programming language that inherently supports the concepts of vectors and column store databases rather than the rows and records that traditional SQL supports.
The main difference is that traditional SQL doesn’t have a concept of order built in, whereas the q programming language does. Unlike traditional SQL, the language q contains a concept of order. This makes complete sense when dealing with time-series data.
Q is intuitive and the syntax is extremely concise, which leads to more productivity, less maintenance and quicker turn-around time.
Q7. Could you give us some examples of successful Big Data real time analytics projects you have been working on?
Simon Garland: Utility applications are using kdb+ for millisecond queries of tables with hundreds of billions of data points captured from millions of smart meters. Analytics on this data can be used for balancing power generation, managing blackouts and for billing and maintenance.
Internet companies with massive amounts of traffic are using kdb+ to analyze Googlebot behavior to learn how to modify pages to improve their ranking. They tell us that traditional databases simply won’t work when they have 100 million pages receiving hundreds of millions of hits per day.
In industries like pharmaceuticals, where decision-making is based on data that can be one day, one week or one month old, our customers and prospects say our column store database makes their legacy data warehouse software obsolete. It is many times faster on the same queries. The time needed for complex analyses on extremely large tables has literally been reduced from hours to seconds.
Q8. Are there any similarities in the way large data sets are used in different vertical markets such as financial service, energy & pharmaceuticals?
Simon Garland: The shared feature is that all of our customers have structured, time-series data. The scale of their data problems are completely different, as are their business use cases. The financial services industry, where kdb+ is an industry standard, demands constant improvements to real-time analytics.
Other industries, like pharma, telecom, oil and gas and utilities, have a different concept of time. They also often are working with smaller data extracts, which they often still consider “Big Data.” When data comes in one day, one week or one month after an event occurred, there is not the same sense of real-time decision making as in finance. Having faster results for complex analytics helps all industries innovate and become more responsive to their customers.
Q9. Anything else you wish to add?
Simon Garland: If we piqued your interest, we have a free, 32-bit version of kdb+ available for download on our web site.
Simon Garland, Chief Strategist, Kx Systems
Simon is responsible for upholding Kx’s high standards for technical excellence and customer responsiveness. He also manages Kx’s participation in the Securities Trading Analysis Center, overseeing all third-party benchmarking.
Prior to joining Kx in 2002, Simon worked at a database search engine company.
Before that he worked at Credit Suisse in risk management. Simon has developed software using kdb+ and q, going back to when the original k and kdb were introduced. Simon received his degree in Mathematics from the University of London and is currently based in Europe.
Follow ODBMS.org on Twittwer: @odbmsorg
“The future of procurement lies in optimising cost and managing risk across the entire supplier base; not just the larger suppliers. Easy access to a complete view of supplier relationships across the enterprise will help those responsible for procurement to make favorable decisions, eliminate waste, increase negotiating leverage and manage risk better. “–Shobhit Chugh.
Data Curation, Big Data and the challenges and the future of Procurement/Supply Chain Management are among the topics of the interview with Shobhit Chugh, Product Marketing Lead at Tamr, Inc.
Q1. In your opinion, what is the future of Procurement/Supply Chain Management?
Shobhit Chugh: Procurement spend is one of the largest spend items for most companies; and supplier risk is one of the items that keeps CEOs of manufacturing companies up at night. Just recently, for example, an issue with a haptic device supplier created a shortage of Apple Watches just after the product’s launch.
At the same time, the world is changing: more data sources are available with increasing variety, and that keeps changing with frequent mergers and acquisitions. The future of procurement lies in optimizing cost and managing risk across the entire supplier base; not just the larger suppliers. Easy access to a complete view of supplier relationships across the enterprise will help those responsible for procurement to make favorable decisions, eliminate waste, increase negotiating leverage and manage risk better.
Q2. What are the current key challenges for Procurement/Supply Chain Management?
Shobhit Chugh: Companies looking for efficiency in their supply chains are limited by the siloed nature of procurement. The domain knowledge needed to properly evaluate suppliers typically resides deep in business units and suppliers are managed at ground level, preventing organizations from taking a global view of suppliers across the enterprise. Those people selecting and managing vendors want to drive terms that favor their company, but don’t have reliable cross-enterprise information on suppliers to make those decisions, and the cost of organizing and analyzing the data has been prohibitive.
Q3. What is the impact of Big Data on the Procurement/Supply Chain?
Shobhit Chugh: A brute force, manual effort to get a single view of suppliers on items such as terms, prices, risk metrics, quality, performance, etc. has traditionally been nearly impossible to do cost effectively. Even if the data exists within the organization, data challenges make it hard to consolidate information into a single view across business units. Rule-based approaches for unifying this data have scale limitations and are difficult to enforce given the distributed nature of procurement. And this does not even include the variety of external data sources that companies can take advantage of, which further increases the potential impact of big data.
Big data changes the situation by providing the ability to evaluate supplier contracts and performance in real time, and puts that intelligence in the hands of people working with suppliers so they can make better decisions. Big data holds significant promise, but only when data unification brings the decentralized data and expertise together to serve the greater good.
Q4. Why does this challenge call for data unification?
Shobhit Chugh: The quality of analysis coming out of procurement optimization is directly related to the volume and quality of data going in. Bringing that data together is no minor feat. In our experience, any individual in an organization can effectively use no more than ten percent of the organization’s data even under very good conditions. Given the distributed nature of procurement, that figure is likely dramatically lower in this situation. Cataloging the hundreds or thousands of internal and external data sources related to procurement provides the foundation for improved decision making.
Similarly, the ability to compare data is directly correlated to the ability to match data points in the same category or related to the same supplier. This is where top-down approaches often get bogged down. Part names, supplier names, site IDs and other data attributes need to be normalized and organized. The efficiency of big data is severely limited if like data sets in various formats aren’t brought together for meaningful comparison.
Q5. How is data unification related to Procurement/Supply Chain Management?
Shobhit Chugh: There are several ways for highly trained data scientists to combine a handful of sources for analysis. Procurement optimization across all suppliers is a markedly different challenge. Procurement data for a company could reside in dozens to thousands of places with very little similarity with regard to how the data is organized. Not only is this data hard for a centralized resource to find and collect, it is hard for non-experts to properly organize and prepare for analysis.
This data must be curated so that analysis returns meaningful results.
One thing I want to emphasize is that data unification is an ongoing activity rather than a one-time integration task. Companies that recognize this continue to extract the maximum value out of data, and are also able to adapt to opportunities to bring in more internal and external data sources when the opportunity presents itself.
Q6. Can you put that in the context of a real world example?
Shobhit Chugh: A highly diversified manufacturer we work with wanted a single view of suppliers across numerous information silos spanning multiple business units. A supplier master list would ultimately contain over a hundred thousand supplier records from many ERP systems. Just one business unit was maintaining over a dozen ERP systems, with new ERP systems regularly coming on line or being added through acquisitions. The list of suppliers also changed rapidly, making functions like deduplication nearly impossible to maintain. Additionally, the company wanted to integrate external data to enrich internal data with information on each supplier’s fiscal strength and structure.
A “bottom-up,” probabilistic approach to data integration proved to be more scalable than a traditional “top-down” manual approach, due to the sheer volume and variety of data sources. Specifically, the company leveraged our machine learning algorithms to continuously re-evaluate and remove potential duplicate entries, driving automation supported by expert guidance into a previously manual process performed by non-experts. The initial result was elimination of 33 percent of suppliers from the master list, just through deduplication.
The company then looked across multiple businesses’ governance systems for suppliers that were related through a corporate structure and identified a significant overlap. Using the same core master list, operational teams were able to treat supplier subsidiaries as different entities for payment purposes, while analytics teams got a global view of a supplier to ensure consistent payment terms. From hundreds of single-use sources, the company created a single view of suppliers with multiple important uses.
Q7. When you talk about data curation, who is doing the curation and for whom? Is it centralized?
Shobhit Chugh: Everyone responsible for a supplier relationship, and the corresponding data, has an interest in the completeness of the data pool, and an interest in the most complete analysis possible. They don’t have an interest in committing the time required to unify the data manually. Our approach is to use ever-improving machine learning to handle the bulk of data matching and rely on subject matter experts only when needed. Further, the system learns which experts to ask each time help is needed, depending on the situation. Once the data is unified, it is available for use by all, including data scientists and corporate leaders far removed from the front lines.
Q8. Do all data-enabled organizations need to hire the best data scientists they can find?
Shobhit Chugh: Yes, data-driven companies should create data-driven innovation, and non-obvious insights often take good data scientists who are tasked with looking beyond the next supplier for ways data can impact other areas of the business. Here, too the decentralized model of data unification has dramatic benefits.
The current scarcity of qualified data scientists will only deepen as the growth in demand is expected to far outpace the rate of qualified professionals entering the field. Everyone is looking to hire the best and brightest data scientists to get insights from their data, but relentless hiring is the wrong way to solve the problem. Data scientists spend 80 percent of their time finding and preparing data, and only twenty percent actually finding answers to critical business questions. Therefore, the better path to scaling data scientists is enabling the ones you have to spend more time on analysis rather than data preparation.
Q9. What is the ROI a company could expect from using data unification for procurement?
Shobhit Chugh: Procurement is an exciting area for data unification precisely because once data is unified, value can be derived using existing best practices, now with a much larger percentage of the supplier base.
Value includes better payment terms, cost savings, higher raw material and part quality and lower supplier risk.
Seventy-five to 80 percent of the value of procurement optimization strategies will come from smaller suppliers and contracts, and data unification unlocks this value.
Q10. What do you predict will be the top five challenges for procurement to tackle in the next two years?
Shobhit Chugh: Using data unification and powerful analysis tools, companies will begin to see immediate value from:
• Achieving “most favored” status from suppliers and eliminating poorly structured contracts where suppliers have multiple customers in your organization
• Build holistic relationships with supplier parent organizations based on the full scope of their subsidiaries’ commitments
• Eliminate rules-based approaches to supplier sourcing and other top-down strategies in favor of data-driven, bottom-up strategies that make use of expertise and data spread throughout the organization
• Embrace the variety of pressure points in procurement – price, delivery, quality, minimums, payment terms, risk, etc. – as ways to customize vendor relationships to suit each need rather than a fog that obscures the value of each contract
• Identify the internal procurement “rock stars” and winning strategies that drive the most value for your organization and replicate those ideas enterprise-wide
Qx. Anything else you wish to add?
Shobhit Chugh: The final component we haven’t discussed is the timing associated with these gains.
We’ve seen procurement optimization projects performed in days or weeks that unleash the vast untapped majority of data locked in previously unknown sources. Not long ago, similar projects focused on just the top suppliers took months and quarters. Addressing the full spectrum of suppliers in this way was not feasible. The combination of data unification and big data is perfectly suited to bringing value quickly and sustaining that value by staying on top of the continual tide of new data.
Shobhit Chugh leads product marketing for Tamr, which empowers organizations to leverage all of their data for analytics by automating the cataloging, connection and curation of “hard-to-reach” data with human-guided machine learning. He has spent his career in tech startups including High Start Group, Lattice Engines,Adaptly and Manhattan Associates. He has also worked as a consultant at McKinsey & Company’s Boston and New York offices, where he advised high tech and financial services clients on technology and sales and marketing strategy.
Shobhit holds an MBA from Kellogg School of Management, a Master’s of Engineering Management in Design from McCormick School of Engineering at Northwestern University, and a Bachelor of Technology in Computer Science from Indian Institute of Technology, Delhi.
–Procurement: Fueling optimization through a simplified, unified view, White Paper Tamr (Link to Download , Registration required)
– Data Curation at Scale: The Data Tamer System (LINK to .PDF)
ODBMS.org Experts Notes
Selected contributions from ODBMS.org experts panel:
– Big data, big trouble.
– Data Acceleration Architecture/ Agile Analytics.
– Critical Success Factors for Analytical Models.
– Some Recent Research Insights Operations Research as a Data Science Problem.
–Data Wisdom for Data Science.
Follow ODBMS.org on Twitter: @odbmsorg
“What is different in big data applications, is that sometimes the data is stored in a distributed sense, and even simple processing becomes more challenging” — Charu Aggarwal.
On Data Mining, Data Science and Big Data, I have interviewed Charu Aggarwal, Research Scientist at the IBM T. J. Watson Research Center, an expert in this area.
Q1. You recently edited two books: Data Classification: Algorithms and Applications and Data Clustering: Algorithms and Applications.
What are the main lessons learned in data classification and data clustering that you can share with us?
Charu Aggarwal: The most important lesson, which is perhaps true for all of data mining applications, is that feature extraction, selection and representation are extremely important. It is all too often that we ignore these important aspects of the data mining process.
Q2. How Data Classification and Data Clustering relate to each other?
Charu Aggarwal: Data classification is the supervised version of data clustering. Data clustering is about dividing the data into groups of similar points. In data classification, examples of groups of points are made available to you. Then, for a given test instance, you are supposed to predict which group this point might belong to.
In the latter case, the groups often have a semantic interpretation. For example, the groups might correspond to fraud/not fraud labels in a credit-card application. In many cases, it is natural for the groups in classification to be clustered as well. However, this is not always the case.
Some methods such as semi-supervised clustering/classification leverage the natural connections between these problems to provide better quality results.
Q3. Can data classification and data clustering be useful also for large data sets and data streams? If yes, how?
Charu Aggarwal: Data clustering is definately useful for large data sets, because clusters can be viewed as summaries of the data. In fact, a particular form of fine-grained clustering, referred to as micro-clustering, is commonly used for summarizing high-volume streaming data in real time. These summaries are then used for many different applications, such as first-story detection, novelty detection, prediction, and so on.
In this sense, clustering plays an intermediate role in enabling other applications for large data sets.
Classification can also be used to generate different types of summary information, although it is a little less common. The reason is that classification is often used as the end-user application, rather than as an intermediate application
like clustering. Therefore, big-data serves as a challenge and as an opportunity for classification.
It serves as a challenge because of obvious computational reasons. It serves as an opportunity because you can build more complex and accurate models with larger data sets without creating a situation, where the model inadvertently overfits to the random noise in the data.
Q4. How do you typically extract “information” from Big Data?
Charu Aggarwal: This is a highly application-specific question, and it really depends on what you are looking for. For example, for the same stream of health-care data, you might be looking for different types of information, depending on whether you are trying to detect fraud, or whether you are trying to discover clinical anomalies. At the end of the day, the role of the domain expert can never be discounted.
However, the common theme in all these cases is to create a more compressed, concise, and clean representation into one of the data types we all recognize and know how to process. Of course, this step is required in all data mining applications, and not just big data applications. What is different in big data applications, is that sometimes the data is stored in a distributed sense, and even simple processing becomes more challenging.
For example, if you look at Google’s original MapReduce framework, it was motivated by a need to efficiently perform operations that are almost trivial for smaller data sets, but suddenly become very expensive in the big-data setting.
Q5. What are the typical problems and scenarios when you cluster multimedia, text, biological, categorical, network, streams, and uncertain data?
Charu Aggarwal: The heterogeneity of the data types causes significant challenges.
One problem is that the different data types may often be mixed, as a result of which the existing methods can sometimes not be used directly. Some common scenarios in which such data types arise are photo/music/video-sharing (multimedia), healthcare (time-series streams and biological), and social networks. Among these different data types, the probabilistic (uncertain) data types does not seem to have graduated from academia into industry very well. Of course, it is a new area and there is a lot of active research going on. The picture will become clearer in a few years.
Q6. How effective are today ́s clustering algorithms?
Charu Aggarwal: Clustering problems have become increasingly effective in recent years because of advances in high-dimensional methods. In the past, when the data was very high-dimensional most existing methods work poorly because of locally irrelevant attributes and concentration effects. These are collectively referred to as the curse of dimensionality. Techniques such as subspace and projected clustering have been introduced to discover clusters in lower dimensional views of the data. One nice aspect of this approach is that some variations of it are highly interpretable.
Q7. What is in common between pattern recognition, database analytics, data mining, and machine learning?
Charu Aggarwal: They really do the same thing, which is that of analyzing and gleaning insights from data. It is just that the styles and emphases are different in various communities. Database folks are more concerned
about scalability. Pattern recognition and machine learning folks are somewhat more theoretical. The statistical folks tend to use their statistical models. The data mining community is the most recent one, and it was formed to create a common meeting ground for these diverse communities.
The first KDD conference was held in 1995, and we have come a long way since then towards integration. I believe that the KDD conference has played a very major role in the amalgamation of these communities. Today, it is actually possible for the folks from database and machine learning communities to be aware of each other’s work. This was not quite true 20 years ago.
Q8. What are the most “precise” methods in data classification?
Charu Aggarwal: I am sure that you will find experts who are willing to swear by a particular model. However, each model comes with a different set of advantages over different data sets. Furthermore, some models, such as univariate decision trees and rule-based methods, have the advantage of being interpretable even when they are outperformed by other methods. After all, analysts love to know about the “why” aside from the “what.”
While I cannot say which models are the most accurate (highly data specific), I can certainly point to the most “popular” ones today from a research point of view. I would say that SVMs, and neural networks (deep learning) are the most popular classification methods. However, my personal experience has been mixed.
While I have found SVMs to work quite well across a wide variety of settings, neural networks are generally less robust. They can easily over fit to noise or show unstable performance over small ranges of parameters. I am watching the debate over deep learning with some interest to see how it plays out.
Q9. When to use Mahout for classification? and What is the advantage of using Mahout for classification?
Charu Aggarwal: Apache Mahout is a scalable machine learning environment for data mining applications. One distinguishing feature of Apache Mahout is that it builds on top of distributed infrastructures like MapReduce, and enables easy building of machine learning applications. It includes libraries of various operations and applications.
Therefore, it reduces the effort of the end user beyond the basic MapReduce framework. It should be used in cases, where the data is large enough to require the use of such distributed infrastructures.
Q10. What are your favourite success stories in Data Classifications and/or Data Clustering?
Charu Aggarwal: One of my favorite success stores is in the field of high dimensional data, where I explored the effect of locally irrelevant dimensions and concentration effects on various data mining algorithms.
I designed a suite of algorithms for such high-dimensional tasks as clustering, similarity search, and outlier detection.
The algorithms continue to be relevant even today, and we have even generalized some of these results to big-data (streaming) scenarios and other application domains, such as the graph and text domains.
Qx Anything else you wish to add?
Charu Aggarwal: Data mining and data sciences are at exciting cross-roads today. I have been working in this field since 1995, and I have never seen as much excitement about data science in my first 15 years, as I have seen
in the last 5. This is truly quite amazing!
Charu C. Aggarwal is a Research Scientist at the IBM T. J. Watson Research Center in Yorktown Heights, New York.
He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from Massachusetts Institute of Technology in 1996.
His research interest during his Ph.D. years was in combinatorial optimization (network flow algorithms), and his thesis advisor was Professor James B. Orlin.
He has since worked in the field of performance analysis, databases, and data mining. He has published over 200 papers in refereed conferences and journals, and has applied for or been granted over 80 patents. He is author or editor of nine books.
Because of the commercial value of the above-mentioned patents, he has received several invention achievement awards and has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research Division Award (2008) for his scientific contributions to data stream research.
He has served on the program committees of most major database/data mining conferences, and served as program vice-chairs of the SIAM Conference on Data Mining, 2007, the IEEE ICDM Conference, 2007, the WWW Conference 2009, and the IEEE ICDM Conference, 2009. He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering Journal from 2004 to 2008. He is an associate editor of the ACM TKDD Journal, an action editor of the Data Mining and Knowledge Discovery Journal, an associate editor of the ACM SIGKDD Explorations, and an associate editor of the Knowledge and Information Systems Journal.
He is a fellow of the IEEE for “contributions to knowledge discovery and data mining techniques”, and a life-member of the ACM.
– MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
Appeared in:OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. Download: PDF Version
Follow ODBMS.org on Twitter: @odbmsorg
“The IoT will have many, many trillions of connections, particularly considering it’s not just the devices that are connected, but people, organizations, applications, and the underlying network” –Emil Eifrem.
I have interviewed Emil Eifrem, CEO of Neo Technology. Among the topics we discussed: Graph Databases, the new release of Neo4j, and how graphs relate to the Internet of Things.
Q1. Michael Blaha said in an interview: “The key is the distinction between being occurrence-oriented and schema-oriented. For traditional business applications, the schema is known in advance, so there is no need to use a graph database which has weaker enforcement of integrity. If instead, you’re dealing with at best a generic model to which it conforms, then a schema-oriented approach does not provide much. Instead a graph-oriented approach is more natural and easier to develop against.”
What is your take on this?
Emil Eifrem: While graphs do excel where requirements and/or data have an element of uncertainty or unpredictability, many of the gains that companies experience from using graph databases don’t require the schema to be dynamic. What make graph databases suitable is problems where relationships between the data, and not just the data, matter.
That said, I agree that a graph-oriented approach is incredibly natural and easier to develop against. We see this again and again, with development cycle times reduced by 90% in some cases. Performance is also a common driver.
Q2. You recently released Neo4j 2.2. What are the main enhancements to the internal architecture you have done in Neo4j 2.2 and why?
Emil Eifrem: Neo4j 2.2 makes huge strides in performance and scalability. Performance of Cypher queries is up to 100 times faster than before, thanks to a Cost-Based Optimizer, that includes Visual Query Plans as a tuning aid.
Read scaling for highly concurrent workloads can be as much as 10 times higher with the new In-Memory Page Cache, which helps users take better advantage of modern hardware. Write scaling is also significantly higher for highly concurrent transactional workloads. Our engineering team found some clever ways of increasing throughput, by buffering writes to a single transaction log, rather than blocking transactions one at a time where each transaction committed to two transaction logs (graph and index). The last internal architecture change was to integrate a bulk loader into the product. It’s blindingly fast. We use Neo4j internally and a load that took many hours transactionally runs in four minutes with the bulk loader. It operates at throughputs of a million records per second, even for extremely large graphs.
Besides all of the internal improvements, this release also includes a lot of the top-requested developer features from the community in the developer tooling, such as built-in learning and visualization improvements.
Q3. With Neo4j 2.2 you introduce a new page cache. Could you please explain what is new with this page cache?
Emil Eifrem: Neo4j has two levels of caching. In earlier versions, Neo4j delegated the lower level cache to the operating system, by memory mapping files. OS memory mapping is optimized for a wide range of workloads.
As users have continued to push into bigger & bigger workloads, with more and more data, we decided it was time to build a specialized cache, built specially for Neo4j workloads. The page cache uses an LRU-K algorithm, and is auto-configured and statistically optimized, to deliver vastly improved scalability in highly concurrent workloads. The result is much better read scaling in multi-core environments that maintains the ultra-fast performance that’s been the hallmark of Neo4j.
Q4. How is this new page cache helping overcoming some of the limitations imposed by current IO systems? Do you have any performance measurements to share with us?
Emil Eifrem: The benefits kick in progressively as you add cores and threads. In the labs we’ve seen up to 10 times higher read throughput compared to previous versions of Neo4j, in large simulations. We also have some very positive reports from the field indicating similar gains.
Q5. What enhancements did you introduce in Neo4j 2.2 to improve both transactional and batch write performance?
Emil Eifrem: Write throughput has gone up because of two improvements. One is the fast-write buffering architecture. This lets multiple transactions flush to disk at the same time, in a way that improves throughput without sacrificing latency. Secondly, there is a change to the structure of the transaction logs. Prior to 2.2, writes used to be committed one at a time with two-phase commit for both the graph and its index. With the unified transaction log, multiple writes can be committed together, using a more efficient approach than before, for ensuring ACIDity between the graph and indexes.
For bulk initial loading, there’s something entirely different, a utility called “neo4j-import” that’s designed to load data at extremely high rates. We’ve seen complex graphs with tens of billions of nodes and relationships loading at rates of 1M records per second.
Q6. You introduced a cost-based query planner, Cypher, which uses statistics about data sets. How does it work? What statistics do you use about data sets?
Emil Eifrem: In 2.2 we introduced both a cost-based optimizer and a visual query planner for Cypher.
The cost-based optimizer gathers statistics such as the total number of nodes by label and calculates the most efficient query path based not just on information about the question being asked, but information about the data patterns in the graph. While some Cypher read queries perform just as fast as they did before, others can be 100 times faster.
The visual query planner provides insight into how the Neo4j optimizer will execute a query, helping users write better and faster queries because Cypher is more transparent.
Q7. Gartner recently said that “Graph analysis is possibly the single most effective competitive differentiator for organizations pursuing data-driven operations and decisions after the design of data capture.”
Graph analysis does not necessarily imply the need a dedicated graph database. Do you have any comment?
Emil Eifrem: Gartner is making a business statement, not a technology statement. Graph analysis refers to a business activity. The best tool we know for carrying out relevant and valuable graph analysis problems is graph databases.
The real value of using a dedicated graph database is in the power to use data relationships easily. Because data and its relationships are stored and processed as they naturally occur in a graph database, elements such as index-free adjacency lead to ultra-accurate and speedy responses to even the most complex queries.
Businesses that build operational applications on the right graph database experience measurable benefits: better performance overall, more competitive applications that incorporate previously impossible-to-include real-time features, easier development cycles that lead to faster time-to-market, and higher revenues thanks to speedier innovation and sharper fraud detection.
Q8. How do you position Neo4j with respect to RDBMS which handles XML and RDF data, and to NoSQL databases which handle graph-based data?
Emil Eifrem: While it is possible to use Neo4j to model an RDF-style graph, our observation is that most people who have tried doing this have found RDF much more difficult to learn and use than the property graph model and associated query methods. This is unsurprising given that RDF is a web standard created by an organization chartered with world wide web standards (the W3C), which has a very different set of requirements than organizations do for their enterprise databases. We saw the need to invent a model suited for persistent data inside of an enterprise, for use as a database data model.
As for XML, that again is a great data transport and exchange mechanism, and is conceptually similar to what’s done in document databases. But it’s not really a suitable model for database storage: if what you care about is relating things across your network. XML databases experienced some hype early on but never caught on.
While we’re talking about document databases … there’s another point here worth drilling into, which is not the data model, but the consistency model. If you’re dealing with isolated documents, then it’s okay for the scope of the transaction to be limited to one object. This means eventual consistency is okay. With graphs, because things relate to one another, if you don’t ensure that related things get written to the database on an “all or nothing” basis, then you can very quickly corrupt your graph. This is why BASE is sufficient for other forms of NoSQL, but not for graphs.
Q9. Could you please give use some examples on how graph databases could help supporting the Internet of Things (IoT)?
Emil Eifrem: We love Neo4j for the Internet of Things, but we’d like to see it renamed to Internet of Connected Things! After all, the value is in the connections between all of the things, that is, the connections and interactions between the devices.
Two points are worth remembering:
- Devices in isolation bring little value to the IoT; rather, it’s the connections between devices that truly bring forth the latent possibilities.
- We’re not just speaking about tracking billions of connections; the IoT will have many, many trillions of connections, particularly considering it’s not just the devices that are connected, but people, organizations, applications, and the underlying network.
Understanding and managing these connections will be at least as important for businesses as understanding and managing the devices themselves. Imagination is key to unlocking the value of connected things. For example, in a telecommunications or aviation network, the questions, “What cell tower is experiencing problems?” and “Which plane will arrive late?” can be answered much more accurately by understanding how the individual components are connected and impact one another. Understanding connections is also key to understanding dependencies and uncovering cascading impacts.
Q10. What are your top 3 favourite case studies for Neo4j?
Emil Eifrem: My top three use cases are:
- Real-time Recommendations – Personalize product, content and service offers by leveraging data relationships. (dynamic pricing, financial services products, online retail, routing & networks)
- Fraud Detection – Improve existing fraud detection resulting methods by uncovering hidden relationships to discover fraud rings and indirections. (financial services, health care, government, gaming)
- Master Data Management – Improve business outcome through storage and retrieval of complex and ‘hierarchical’ master data. Top MDM data sets across our customer base include: customer (360 degree view), organizational hierarchy, employee (HR), product / product line management, metadata management / data governance, and CMDB, as well as digital assets. (financial services, telecommunications, insurance, agribusiness)
Q11. Anything else you wish to add?
Emil Eifrem: Yes, one thing. As much as we’re a product company, we are very passionate educators and evangelists. Our mission is to help the world make sense of data. We discovered that graphs are an amazing way of doing that, and we’re working hard to share that with the world.
For anyone interested in learning more, our web site offers a lot of great learning resources: talks, examples, free training… we’ve even worked with O’Reilly to offer up their Graph Databases e-book. By the time this article is published, the second edition should be up on http://graphdatabases.com. Any of your readers who are interested, are welcome to come and learn, and become part of the amazing & rapidly growing worldwide graph database community.
It’s been great to speak with you!
Emil is the founder of the Neo4j open source graph database project, the most widely deployed graph database in the world. As the CEO of Neo4j’s commercial sponsor Neo Technology Emil spreads the word about the powers of graphs everywhere. Emil is a co-author of the O’Reilly Media book “Graph Databases” and presents regularly at conferences around the world such as JAOO, JavaOne, QCon, and OSCON.
Follow ODBMS.org on Twitter: @odbmsorg
“Today, we’re storing and processing tens of petabytes of data on a daily basis, which poses the big challenge in building a highly reliable and scalable data infrastructure.”–Krishna Gade.
I have interviewed Krishna Gade, Engineering Manager on the Data team at Pinterest.
Q1. What are the main challenges you are currently facing when dealing with data at Pinterest?
Krishna Gade: Pinterest is a data product and a data-driven company. Most of our Pinner-facing features like recommendations, search and Related Pins are created by processing large amounts of data every day. Added to this, we use data to derive insights and make decisions on products and features to build and ship. As Pinterest usage grows, the number of Pinners, Pins and the related metadata are growing rapidly. Today, we’re storing and processing tens of petabytes of data on a daily basis, which poses the big challenge in building a highly reliable and scalable data infrastructure.
On the product side, we’re curating a unique dataset we call the ‘interest graph’ which captures the relationships between Pinners, Pins, boards (collections of Pins) and topic categories. As Pins are visual bookmarks of web pages saved by our Pinners, we can have the same web page Pinned many different times. One of the problems we try to solve is to collate all the Pins that belong to the same web page and aggregate all the metadata associated with them.
Visual discovery is an important feature in our product. When you click on a Pin we need to show you visually related Pins. In order to do this we extract features from the Pin image and apply sophisticated deep learning techniques to suggest Pins related to the original. There is a need to build scalable infrastructure and algorithms to mine and extract value from this data and apply to our features like search, recommendations etc.
Q2. You wrote in one of your blog posts that “data-driven decision making is in your company DNA”. Could please elaborate and explain what do you mean with that?
Krishna Gade: It starts from the top. Our senior leadership is constantly looking for insights from data to make critical decisions. Every day, we look at the various product metrics computed by our daily pipelines to measure how the numerous product features are doing. Every change to our product is first tested with a small fraction of Pinners as an A/B experiment, and at any given time we’re running hundreds of these A/B experiments. Over time data-driven decision making has become an integral part of our culture.
Q3. Specifically, what do you use Real-time analytics for at Pinterest?
Krishna Gade: We build batch pipelines extensively throughout the company to process billions of Pins and the activity on them. These pipelines allow us to process vast amounts of historic data very efficiently and tune and personalize features like search, recommendations, home feed etc. However these pipelines don’t capture the activity happening currently – new users signing up, millions of repins, clicks and searches. If we only rely on batch pipelines, we won’t know much about a new user, Pin or trend for a day or two. We use real-time analytics to bridge this gap.
Our real-time data pipelines process user activity stream that includes various actions taken by the Pinner (repins, searches, clicks, etc.) as they happen on the site, compute signals for Pinners and Pins in near real-time and make these available back to our applications to customize and personalize our products.
Q4 Could you pls give us an overview of the data platforms you use at Pinterest?
Krishna Gade: We’ve used existing open-source technologies and also built custom data infrastructure to collect, process and store our data. We built a logging agent Singer, deployed on all of our web servers that’s constantly pumping log data into Kafka, which we use as a log transport system. After the logs reach Kafka, they’re copied into Amazon S3 by our custom log persistence service called Secor. We built Secor to ensure 0-data loss and overcome the weak eventual consistency model of S3.
After this point, our self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing. All our large scale batch pipelines run on Hadoop, which is the core data infrastructure we depend on for improving and observing our product. Our engineers use either Hive or Cascading to build the data pipelines, which are managed by Pinball – a flexible workflow management system we built. More recently, we’ve started using Spark to support our machine learning use-cases.
Q5. You have built a real-time data pipeline to ingest data into MemSQL using Spark Streaming. Why?
Krishna Gade: As of today, most of our analytics happens in the batch processing world. All the business metrics we compute are powered by the nightly workflows running on Hadoop. In the future our goal is to be able to consume real-time insights to move quickly and make product and business decisions faster. A key piece of infrastructure missing for us to achieve this goal was a real-time analytics database that can support SQL.
We wanted to experiment with a real-time analytics database like MemSQL to see how it works for our needs. As part of this experiment, we built a demo pipeline to ingest all our repin activity stream into MemSQL and built a visualization to show the repins coming from the various cities in the U.S.
Q6. Could you pls give us some detail how is it implemented?
Krishna Gade: As Pinners interact with the product, Singer agents hosted on our web servers are constantly writing the activity data to Kafka. The data in Kafka is consumed by a Spark streaming job. In this job, each Pin is filtered and then enriched by adding geolocation and Pin category information. The enriched data is then persisted to MemSQL using MemSQL’s spark connector and is made available for query serving. The goal of this prototype was to test if MemSQL could enable our analysts to use familiar SQL to explore the real-time data and derive interesting insights.
Q7. Why did you choose MemSQL and Spark for this? What were the alternatives?
Krishna Gade: I led the Storm engineering team at Twitter, and we were able to scale the technology for hundreds of applications there. During that time I was able to experience both good and bad aspects of Storm.
When I came to Pinterest, I saw that we were beginning to use Storm but mostly for use-cases like computing the success rate and latency stats for the site. More recently we built an event counting service using Storm and HBase for all of our Pin and user activity. In the long run, we think it would be great to consolidate our data infrastructure to a fewer set of technologies. Since we’re already using Spark for machine learning, we thought of exploring its streaming capabilities. This was the main motivation behind using Spark for this project.
As for MemSQL, we were looking for a relational database that can run SQL queries on streaming data that would not only simplify our pipeline code but would give our analysts a familiar interface (SQL) to ask questions on this new data source. Another attractive feature about MemSQL is that it can also be used for the OLTP use case, so we can potentially have the same pipeline enabling both product insights and user-facing features. Apart from MemSQL, we’re also looking at alternatives like VoltDB and Apache Phoenix. Since we already use HBase as a distributed key-value store for a number of use-cases, Apache Phoenix which is nothing but a SQL layer on top of HBase is interesting to us.
Q8. What are the lessons learned so far in using such real-time data pipeline?
Krishna Gade: It’s early days for the Spark + MemSQL real-time data pipeline, so we’re still learning about the pipeline and ingesting more and more data. Our hope is that in the next few weeks we can scale this pipeline to handle hundreds of thousands of events per second and have our analysts query them in real-time using SQL.
Q9. What are your plans and goals for this year?
Krishna Gade: On the platform side, our plan to is to scale real-time analytics in a big way in Pinterest. We want to be able to refresh our internal company metrics, signals into product features at the granularity of seconds instead of hours. We’re also working on scaling our Hadoop infrastructure especially looking into preventing S3 eventual consistency from disrupting the stability of our pipelines. This year should also see more open-sourcing from us. We started the year by open-sourcing Pinball, our workflow manager for Hadoop jobs. We plan to open-source Singer our logging agent sometime soon.
One the product side, one of our big goals is to scale our self-serve ads product and grow our international user-base. We’re focusing especially on markets like Japan and Europe to grow our user-base and get more local content into our index.
Qx. Anything else you wish to add?
Krishna Gade: For those who are interested in more information, we share latest from the engineering team on our Engineering blog. You can follow along with the blog, as well as updates on our Facebook Page. Thanks a lot for the opportunity to talk about Pinterest engineering and some of the data infrastructure challenges.
Krishna Gade is the engineering manager for the data team at Pinterest. His team builds core data infrastructure to enable data driven products and insights for Pinterest. They work on some of the cutting edge big data technologies like Kafka, Hadoop, Spark, Redshift etc. Before Pinterest, Krishna was at Twitter and Microsoft building large scale search and data platforms.
–Singer, Pinterest’s Logging Infrastructure (LINK to SlideShares)
–Introducing Pinterest Secor (LINK to Pinterest engineering blog)
–MemSQL’s spark connector (memsql/memsql-spark-connector GitHub)
Follow ODBMS.org on Twitter: @odbmsorg