“One common struggle for data-driven enterprises is managing unnecessarily complicated data workflows with bloated ETL pipelines and a lack of native system integration.”– John Leach
I have interviewed John Leach, CTO & Cofounder Splice Machine. Main topics of the interview are Hadoop, Big Data integration and what Splice Machine has to offer in this space. Monte Zweben, CEO of Splice Machine also contributed to the interview.
Q1. What are the Top Ten Pitfalls to Avoid in a SQL-on-Hadoop Implementation?
John Leach, Monte Zweben:
1. Individual record lookups. Most SQL-on-Hadoop engines are designed for full table scans in analytics, but tend to be too slow for the individual record lookups and ranges scan used by operational applications.
2. Dirty Data. Dirty data is a problem for any system, but it is compounded in Big Data, often resulting in bad reports and delays to reload an entire data set.
3. Sharding. It can be difficult to know what key to distribute data and the right shard size. This results in slow queries, especially for large joins or aggregations.
4. Hotspotting. This happens when data becomes too concentrated in a few nodes, especially for time series data. The impact is slow queries and poor parallelization.
5. SQL coverage. Limited SQL dialects will make it so you can’t run queries to meet business needs. You’ll want to make sure you do your homework. Compile the list of toughest queries and test.
6. Concurrency. Low concurrency can result in the inability to power real-time apps, handle many users, support many input sources, and deliver reports as updates happen.
7. Columnar. Not all columnar solutions are created equally. Besides columnar storage, there are many other optimizations, such as vectorization and run length encoding that can have a big impact on analytic performance. If your OLAP queries run slower, common with large joins and aggregations, this will result in poor productivity. Queries may take minutes or hours instead of seconds. On the flip-side is using columnar when you need concurrency and real-time.
8. Node Sizing. Do your homework and profile your workload. Choosing the wrong node size (e.g., CPU cores, memory) can negatively impact price/performance and create performance bottlenecks.
9. Brittle ETL on Hadoop. With many SQL-on-Hadoop solutions being unable to provide update or delete capabilities without a full data reload, this can cause a very brittle ETL that will require restarting your ETL pipeline because of errors or data quality issues. The result is a missed ETL window and delayed reports to business users.
10. Cost-Based Optimizer. A cost-based optimizer improves performance by selecting the right join strategy, the right index, and the right ordering. Some SQL-on-Hadoop engines have no cost-based optimizer or relatively immature ones that can result in poor performance and poor productivity, as well as manual tuning by DBAs.
Q2. In your experience, what are the most common problems in Big Data integration?
John Leach, Monte Zweben: Providing users access to data in a fashion they can understand and at the moment they need it, while ensuring quality and security, can be incredibly challenging.
The volume and velocity of data that businesses are churning out, along with the variety of different sources, can pose many issues.
One common struggle for data-driven enterprises is managing unnecessarily complicated data workflows with bloated ETL pipelines and a lack of native system integration. Businesses may also find their skill sets, workload, and budgets over-stretched by the need to manage terabytes or petabytes of structured and unstructured data in a way that delivers genuine value to business users.
When data is siloed and there is no solution put into place, businesses can’t access the real-time insights they need to make the best decisions for their business. Performance goes down, headaches abound and cost goes way up, all in the effort to manage the data. That’s why a Big Data integration solution is a prerequisite for getting the best performance and the most real-time insights, at the lowest cost.
Q3. What are the capabilities of Hadoop beyond data storage?
John Leach, Monte Zweben: Hadoop has a very broad range of capabilities and tools:
– Oozie for workflow
– Pig for scripting
– Mahout or SparkML for machine learning
– Kafka and Storm for streaming
– Flume and Sqoop for integration
– Hive, Impala, Spark, and Drill for SQL analytic querying
– HBase for NoSQL
– Splice Machine for operational, transactional RDBMS
Q4. What programming skills are required to handle application development around Big Data platforms like Hadoop?
John Leach, Monte Zweben: To handle application development on Hadoop, individuals have choices to go raw Hadoop or SQL-on-Hadoop. When going the SQL route, very little new skills are required and developers can open connections to an RDBMS on Hadoop just like they used to do on Oracle, DB2, SQLServer, or Teradata. Raw HAdoop application developers should know their way around the core components of the Hadoop stack–such as HDFS, MapReduce, Kafaka, Storm, Oozie, Hive, Pig, HBase, and YARN. They should also be proficient in Java.
Q5. What are the current challenges for real-time application deployment on Hadoop?
John Leach, Monte Zweben: When we talk about real-time at Splice Machine, we’re focused on applications that require not only real-time responses to queries, but also real-time database updates from a variety of data sources. The former is not all that uncommon on Hadoop; the latter is nearly impossible for most Hadoop-based systems.
Deploying real-time applications on Hadoop is really a function of moving Hadoop beyond its batch processing roots to be able to handle real-time database updates with high concurrency and transactional integrity. We harness HBase along with a lockless snapshot isolation design to provide full ACID transactions across rows and tables.
This technology enables Splice Machine to execute the high concurrency of transactions required by real-time applications.
Q6. What is special about Splice Machine auto-sharding replication and failover technology?
John Leach, Monte Zweben: As part of its automatic auto-sharding, HBase horizontally partitions or splits each table into smaller chunks or shards that are distributed across multiple servers. Using the inherent failover and replication capabilities of HBase and Hadoop, Splice Machine can support applications that demand high availability.
HBase co-processors are used to embed Splice Machine in each distributed HBase region (i.e., data shard). This enables Splice Machine to achieve massive parallelization by pushing the computation down to each distributed data shard without any overhead of MapReduce.
Q7. How difficult is it for customers to migrate from legacy databases to Splice Machine?
John Leach, Monte Zweben: Splice Machine offers a variety of services to help businesses efficiently deploy the Splice Machine database and derive maximum value from their investment. These services include both implementation consulting and educational offerings delivered by our expert team.
Splice Machine has designed a Safe Journey program to significantly ease the effort and risk for companies migrating to a Splice Machine database. The Safe Journey program includes a proven methodology that helps choose the right workloads to migrate, implements risk-mitigation best practices, and includes commercial tools that automate most of the PL/SQL conversion process.
This is not to suggest that all legacy databases will convert to a Hadoop RDBMS.
The best candidates will typically have over 1TB of data, which often leads to cost and scaling issues in legacy databases.
Q8. You have recently announced partnership with Talend, mrc (michaels, ross & cole ltd.) and RedPoint Global. Why Talend, mrc, and RedPoint Global? What is the strategic meaning of these partnerships for Splice Machine?
John Leach, Monte Zweben: Our uptick in recent partnerships demonstrates the tremendous progress our team has made over the past year. We have been working relentlessly to develop the Splice Machine Hadoop RDBMS into a fully enterprise-ready database that can replace legacy database systems.
The demand for programming talent to handle application development is growing faster than the supply of skilled talent, especially around newer platforms like Hadoop. We partnered with mrc to give businesses a solution that can speed real-time application deployment on Hadoop with the staff and tools they currently have, while also offering future-proof applications over a database that scales to meet increasing data demands.
We partnered with Talend to bring our customers the benefit of two different approaches for managing data integration affordable and at scale. Talend’s rich capabilities including drag and drop user interface, and adaptable platform allow for increased productivity and streamlined testing for faster deployment of web, mobile, OLTP or Internet of Things applications.
And finally, we integrated and certified our Hadoop RDBMS on RedPoint’s Convergent Marketing Platform™ to create a new breed of solution for marketers. With cost-efficient database scale-out and real-time cross-channel execution, the solution enables enterprises to future-proof their marketing technology investment through affordable access to all their data (social, mobile, click streams, website behaviors, etc.) across a proliferating and ever-changing list of channels. Furthermore, it complements any existing Hadoop deployment, including those on the Cloudera, MapR and Hortonworks distributions.
Q9. How is Splice Machine working with Hadoop distribution partners –such as MapR, Hortonworks and Cloudera?
John Leach, Monte Zweben: Since Splice Machine does not modify HBase, it can be used with any standard Hadoop distribution that includes HBase, including Cloudera, MapR and Hortonworks. Splice Machine enables enterprises using these three companies to tap into real-time updates with transactional integrity, an important feature for companies looking to become real-time, data-driven businesses.
In 2013, Splice Machine partnered with MapR to enable companies to use the MapR distribution for Hadoop to build their real time, SQL-on-Hadoop applications. In 2014, we joined the Cloudera Connect Partner Program, after certifying on CDH 5. We are working closely with Cloudera to maximize the potential of its full suite of Hadoop-powered software and our unique approach to real-time Hadoop.
That same year, we joined Hortonworks Technology Partner program. This enabled our users to harness innovations in management, provisioning and security for HDP deployments. For HDP users, Splice Machine enables them to build applications that use ANSI-standard SQL and support real-time updates with transactional integrity, allowing Hadoop to be used in both OLTP and OLAP applications.
Earlier this year, we were excited to achieve Hortonworks® Data Platform (HDP™) Certification. With the HDP certification, our customers can leverage the pre-built and validated integrations between leading enterprise technologies and the Hortonworks Data Platform, the industry’s only 100-percent open source Hadoop distribution, to simplify and accelerate their Splice Machine and Hadoop deployments.
Q10 What are the challenges of running online transaction processing on Hadoop?
John Leach, Monte Zweben: With its heritage as a batch processing system, Hadoop does not provide the transaction support required by online transaction processing. Transaction support can be tricky enough to implement for shared-disk RDBMSs such as Oracle, but it becomes far more difficult to implement in distributed environments such as Hadoop. A distributed transactional model requires high-levels of coordination across a cluster with too much overhead, while simultaneously providing high performance for a high concurrency of small read and writes, high-speed ingest, and massive bulk loads. We prove this by being able to run the TPC-C benchmark at scale.
Splice Machine met those requirements by using distributed snap isolation, a Multi-Version Concurrency Control model that delivers lockless, and high-concurrency transactional support. Splice Machine extended research from Google’s Percolator project, Yahoo Lab’s OMID project, and the University of Waterloo’s HBaseSI project to develop its own patent-pending, distributed transactions.
John Leach – CTO & Cofounder Splice Machine
With over 15 years of software experience under his belt, John’s expertise in analytics and BI drives his role as Chief Technology Officer. Prior to Splice Machine, John founded Incite Retail in June 2008 and led the company’s strategy and development efforts. At Incite Retail, he built custom Big Data systems (leveraging HBase and Hadoop) for Fortune 500 companies.
Prior to Incite Retail, he ran the business intelligence practice at Blue Martini Software and built strategic partnerships with integration partners. John was a key subject matter expert for Blue Martini Software in many strategic implementations across the world. His focus at Blue Martini was helping clients incorporate decision support knowledge into their current business processes utilizing advanced algorithms and machine learning.
John received dual bachelor’s degrees in biomedical and mechanical engineering from Washington University in Saint Louis. Leach currently is the organizer for the Saint Louis Hadoop Users Group and is active in the Washington University Elliot Society.
Monte Zweben – CEO & Cofounder Splice Machine
A technology industry veteran, Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program.
Monte then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit.
In 1998, Monte was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA.
Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings.
Zweben currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.
“The type of data we see the most is market data, which comes from exchanges like the NYSE, dark pools and other trading platforms. This data may consist of many billions of records of trades and quotes of securities with up to nanosecond precision — which can translate into many terabytes of data per day.”–Simon Garland
Q1. Talking about the financial services industry, what types of data and what quantities are common?
Simon Garland: The type of data we see the most is market data, which comes from exchanges like the NYSE, dark pools and other trading platforms. This data may consist of many billions of records of trades and quotes of securities with up to nanosecond precision — which can translate into many terabytes of data per day.
The data comes in through feed-handlers as streaming data. It is stored in-memory throughout the day and is appended to the on-disk historical database at the day’s end. Algorithmic trading decisions are made on a millisecond basis using this data. The associated risks are evaluated in real-time based on analytics that draw on intraday data that resides in-memory and historical data that resides on disk.
Q2. What are the most difficult data management requirements for high performance financial trading and risk management applications?
Simon Garland: There has been a decade-long arms race on Wall Street to achieve trading speeds that get faster every year. Global financial institutions in particular have spent heavily on high performance software products, as well as IT personnel and infrastructure just to stay competitive. Traders require accuracy, stability and security at the same time that they want to run lightning fast algorithms that draw on terabytes of historical data.
Traditional databases cannot perform at these levels. Column store databases are generally recognized to be orders of magnitude faster than regular RDBMS; and a time-series optimized columnar database is uniquely suited for delivering the performance and flexibility required by Wall Street.
Q3. And why is this important for businesses?
Simon Garland: Orders of magnitude improvements in performance will open up new possibilities for “what-if” style analytics and visualization; speeding up their pace of innovation, their awareness of real-time risks and their responsiveness to their customers.
The Internet of Things in particular is important to businesses who can now capitalize on the digitized time-series data they collect, like from smart meters and smart grids. In fact, I believe that this is only the beginning of the data volumes we will have to be handling in the years to come. We will be able to combine this information with valuable data that businesses have been collecting for decades.
Q4. One of the promise of Big Data for many businesses is the ability to effectively use both streaming data and the vast amounts of historical data that will accumulate over the years, as well as the data a business may already have warehoused, but never has been able to use. What are the main challenges and the opportunities here?
Simon Garland: This can seem like a challenge for people trying to put a system together from a streaming database; an in-memory database from a different vendor, and an historical database from yet another vendor. They then pull data from all of these applications into yet another programming environment. This method cannot give performance and long term is fragile and unmaintainable.
The opportunity here is for a database platform that unifies the software stack, like kdb+, that is robust, easily scalable and easily maintainable.
Q5. How difficult is to combine and process streaming, in-memory and historical data in real time analytics at scale?
Simon Garland: This is an important question. These functionalities can’t be added afterwards. Kdb+ was designed for streaming data, in-memory data and historical data from the beginning. It was also designed with multi-core and multi-process support from the beginning which is essential for processing large amounts of historical data in parallel on current hardware.
We were doing this for decades, even before multi-core machines existed — which is why Wall Street was an early adopter of our technology.
Q6. q programming language vs. SQL: could you please explain the main differences? And also highlight the Pros and cons of each.
Simon Garland: The q programming language is built into the database system kdb+. It is an array programming language that inherently supports the concepts of vectors and column store databases rather than the rows and records that traditional SQL supports.
The main difference is that traditional SQL doesn’t have a concept of order built in, whereas the q programming language does. Unlike traditional SQL, the language q contains a concept of order. This makes complete sense when dealing with time-series data.
Q is intuitive and the syntax is extremely concise, which leads to more productivity, less maintenance and quicker turn-around time.
Q7. Could you give us some examples of successful Big Data real time analytics projects you have been working on?
Simon Garland: Utility applications are using kdb+ for millisecond queries of tables with hundreds of billions of data points captured from millions of smart meters. Analytics on this data can be used for balancing power generation, managing blackouts and for billing and maintenance.
Internet companies with massive amounts of traffic are using kdb+ to analyze Googlebot behavior to learn how to modify pages to improve their ranking. They tell us that traditional databases simply won’t work when they have 100 million pages receiving hundreds of millions of hits per day.
In industries like pharmaceuticals, where decision-making is based on data that can be one day, one week or one month old, our customers and prospects say our column store database makes their legacy data warehouse software obsolete. It is many times faster on the same queries. The time needed for complex analyses on extremely large tables has literally been reduced from hours to seconds.
Q8. Are there any similarities in the way large data sets are used in different vertical markets such as financial service, energy & pharmaceuticals?
Simon Garland: The shared feature is that all of our customers have structured, time-series data. The scale of their data problems are completely different, as are their business use cases. The financial services industry, where kdb+ is an industry standard, demands constant improvements to real-time analytics.
Other industries, like pharma, telecom, oil and gas and utilities, have a different concept of time. They also often are working with smaller data extracts, which they often still consider “Big Data.” When data comes in one day, one week or one month after an event occurred, there is not the same sense of real-time decision making as in finance. Having faster results for complex analytics helps all industries innovate and become more responsive to their customers.
Q9. Anything else you wish to add?
Simon Garland: If we piqued your interest, we have a free, 32-bit version of kdb+ available for download on our web site.
Simon Garland, Chief Strategist, Kx Systems
Simon is responsible for upholding Kx’s high standards for technical excellence and customer responsiveness. He also manages Kx’s participation in the Securities Trading Analysis Center, overseeing all third-party benchmarking.
Prior to joining Kx in 2002, Simon worked at a database search engine company.
Before that he worked at Credit Suisse in risk management. Simon has developed software using kdb+ and q, going back to when the original k and kdb were introduced. Simon received his degree in Mathematics from the University of London and is currently based in Europe.
Follow ODBMS.org on Twittwer: @odbmsorg
“The future of procurement lies in optimising cost and managing risk across the entire supplier base; not just the larger suppliers. Easy access to a complete view of supplier relationships across the enterprise will help those responsible for procurement to make favorable decisions, eliminate waste, increase negotiating leverage and manage risk better. “–Shobhit Chugh.
Data Curation, Big Data and the challenges and the future of Procurement/Supply Chain Management are among the topics of the interview with Shobhit Chugh, Product Marketing Lead at Tamr, Inc.
Q1. In your opinion, what is the future of Procurement/Supply Chain Management?
Shobhit Chugh: Procurement spend is one of the largest spend items for most companies; and supplier risk is one of the items that keeps CEOs of manufacturing companies up at night. Just recently, for example, an issue with a haptic device supplier created a shortage of Apple Watches just after the product’s launch.
At the same time, the world is changing: more data sources are available with increasing variety, and that keeps changing with frequent mergers and acquisitions. The future of procurement lies in optimizing cost and managing risk across the entire supplier base; not just the larger suppliers. Easy access to a complete view of supplier relationships across the enterprise will help those responsible for procurement to make favorable decisions, eliminate waste, increase negotiating leverage and manage risk better.
Q2. What are the current key challenges for Procurement/Supply Chain Management?
Shobhit Chugh: Companies looking for efficiency in their supply chains are limited by the siloed nature of procurement. The domain knowledge needed to properly evaluate suppliers typically resides deep in business units and suppliers are managed at ground level, preventing organizations from taking a global view of suppliers across the enterprise. Those people selecting and managing vendors want to drive terms that favor their company, but don’t have reliable cross-enterprise information on suppliers to make those decisions, and the cost of organizing and analyzing the data has been prohibitive.
Q3. What is the impact of Big Data on the Procurement/Supply Chain?
Shobhit Chugh: A brute force, manual effort to get a single view of suppliers on items such as terms, prices, risk metrics, quality, performance, etc. has traditionally been nearly impossible to do cost effectively. Even if the data exists within the organization, data challenges make it hard to consolidate information into a single view across business units. Rule-based approaches for unifying this data have scale limitations and are difficult to enforce given the distributed nature of procurement. And this does not even include the variety of external data sources that companies can take advantage of, which further increases the potential impact of big data.
Big data changes the situation by providing the ability to evaluate supplier contracts and performance in real time, and puts that intelligence in the hands of people working with suppliers so they can make better decisions. Big data holds significant promise, but only when data unification brings the decentralized data and expertise together to serve the greater good.
Q4. Why does this challenge call for data unification?
Shobhit Chugh: The quality of analysis coming out of procurement optimization is directly related to the volume and quality of data going in. Bringing that data together is no minor feat. In our experience, any individual in an organization can effectively use no more than ten percent of the organization’s data even under very good conditions. Given the distributed nature of procurement, that figure is likely dramatically lower in this situation. Cataloging the hundreds or thousands of internal and external data sources related to procurement provides the foundation for improved decision making.
Similarly, the ability to compare data is directly correlated to the ability to match data points in the same category or related to the same supplier. This is where top-down approaches often get bogged down. Part names, supplier names, site IDs and other data attributes need to be normalized and organized. The efficiency of big data is severely limited if like data sets in various formats aren’t brought together for meaningful comparison.
Q5. How is data unification related to Procurement/Supply Chain Management?
Shobhit Chugh: There are several ways for highly trained data scientists to combine a handful of sources for analysis. Procurement optimization across all suppliers is a markedly different challenge. Procurement data for a company could reside in dozens to thousands of places with very little similarity with regard to how the data is organized. Not only is this data hard for a centralized resource to find and collect, it is hard for non-experts to properly organize and prepare for analysis.
This data must be curated so that analysis returns meaningful results.
One thing I want to emphasize is that data unification is an ongoing activity rather than a one-time integration task. Companies that recognize this continue to extract the maximum value out of data, and are also able to adapt to opportunities to bring in more internal and external data sources when the opportunity presents itself.
Q6. Can you put that in the context of a real world example?
Shobhit Chugh: A highly diversified manufacturer we work with wanted a single view of suppliers across numerous information silos spanning multiple business units. A supplier master list would ultimately contain over a hundred thousand supplier records from many ERP systems. Just one business unit was maintaining over a dozen ERP systems, with new ERP systems regularly coming on line or being added through acquisitions. The list of suppliers also changed rapidly, making functions like deduplication nearly impossible to maintain. Additionally, the company wanted to integrate external data to enrich internal data with information on each supplier’s fiscal strength and structure.
A “bottom-up,” probabilistic approach to data integration proved to be more scalable than a traditional “top-down” manual approach, due to the sheer volume and variety of data sources. Specifically, the company leveraged our machine learning algorithms to continuously re-evaluate and remove potential duplicate entries, driving automation supported by expert guidance into a previously manual process performed by non-experts. The initial result was elimination of 33 percent of suppliers from the master list, just through deduplication.
The company then looked across multiple businesses’ governance systems for suppliers that were related through a corporate structure and identified a significant overlap. Using the same core master list, operational teams were able to treat supplier subsidiaries as different entities for payment purposes, while analytics teams got a global view of a supplier to ensure consistent payment terms. From hundreds of single-use sources, the company created a single view of suppliers with multiple important uses.
Q7. When you talk about data curation, who is doing the curation and for whom? Is it centralized?
Shobhit Chugh: Everyone responsible for a supplier relationship, and the corresponding data, has an interest in the completeness of the data pool, and an interest in the most complete analysis possible. They don’t have an interest in committing the time required to unify the data manually. Our approach is to use ever-improving machine learning to handle the bulk of data matching and rely on subject matter experts only when needed. Further, the system learns which experts to ask each time help is needed, depending on the situation. Once the data is unified, it is available for use by all, including data scientists and corporate leaders far removed from the front lines.
Q8. Do all data-enabled organizations need to hire the best data scientists they can find?
Shobhit Chugh: Yes, data-driven companies should create data-driven innovation, and non-obvious insights often take good data scientists who are tasked with looking beyond the next supplier for ways data can impact other areas of the business. Here, too the decentralized model of data unification has dramatic benefits.
The current scarcity of qualified data scientists will only deepen as the growth in demand is expected to far outpace the rate of qualified professionals entering the field. Everyone is looking to hire the best and brightest data scientists to get insights from their data, but relentless hiring is the wrong way to solve the problem. Data scientists spend 80 percent of their time finding and preparing data, and only twenty percent actually finding answers to critical business questions. Therefore, the better path to scaling data scientists is enabling the ones you have to spend more time on analysis rather than data preparation.
Q9. What is the ROI a company could expect from using data unification for procurement?
Shobhit Chugh: Procurement is an exciting area for data unification precisely because once data is unified, value can be derived using existing best practices, now with a much larger percentage of the supplier base.
Value includes better payment terms, cost savings, higher raw material and part quality and lower supplier risk.
Seventy-five to 80 percent of the value of procurement optimization strategies will come from smaller suppliers and contracts, and data unification unlocks this value.
Q10. What do you predict will be the top five challenges for procurement to tackle in the next two years?
Shobhit Chugh: Using data unification and powerful analysis tools, companies will begin to see immediate value from:
• Achieving “most favored” status from suppliers and eliminating poorly structured contracts where suppliers have multiple customers in your organization
• Build holistic relationships with supplier parent organizations based on the full scope of their subsidiaries’ commitments
• Eliminate rules-based approaches to supplier sourcing and other top-down strategies in favor of data-driven, bottom-up strategies that make use of expertise and data spread throughout the organization
• Embrace the variety of pressure points in procurement – price, delivery, quality, minimums, payment terms, risk, etc. – as ways to customize vendor relationships to suit each need rather than a fog that obscures the value of each contract
• Identify the internal procurement “rock stars” and winning strategies that drive the most value for your organization and replicate those ideas enterprise-wide
Qx. Anything else you wish to add?
Shobhit Chugh: The final component we haven’t discussed is the timing associated with these gains.
We’ve seen procurement optimization projects performed in days or weeks that unleash the vast untapped majority of data locked in previously unknown sources. Not long ago, similar projects focused on just the top suppliers took months and quarters. Addressing the full spectrum of suppliers in this way was not feasible. The combination of data unification and big data is perfectly suited to bringing value quickly and sustaining that value by staying on top of the continual tide of new data.
Shobhit Chugh leads product marketing for Tamr, which empowers organizations to leverage all of their data for analytics by automating the cataloging, connection and curation of “hard-to-reach” data with human-guided machine learning. He has spent his career in tech startups including High Start Group, Lattice Engines,Adaptly and Manhattan Associates. He has also worked as a consultant at McKinsey & Company’s Boston and New York offices, where he advised high tech and financial services clients on technology and sales and marketing strategy.
Shobhit holds an MBA from Kellogg School of Management, a Master’s of Engineering Management in Design from McCormick School of Engineering at Northwestern University, and a Bachelor of Technology in Computer Science from Indian Institute of Technology, Delhi.
–Procurement: Fueling optimization through a simplified, unified view, White Paper Tamr (Link to Download , Registration required)
– Data Curation at Scale: The Data Tamer System (LINK to .PDF)
ODBMS.org Experts Notes
Selected contributions from ODBMS.org experts panel:
– Big data, big trouble.
– Data Acceleration Architecture/ Agile Analytics.
– Critical Success Factors for Analytical Models.
– Some Recent Research Insights Operations Research as a Data Science Problem.
–Data Wisdom for Data Science.
Follow ODBMS.org on Twitter: @odbmsorg
“What is different in big data applications, is that sometimes the data is stored in a distributed sense, and even simple processing becomes more challenging” — Charu Aggarwal.
On Data Mining, Data Science and Big Data, I have interviewed Charu Aggarwal, Research Scientist at the IBM T. J. Watson Research Center, an expert in this area.
Q1. You recently edited two books: Data Classification: Algorithms and Applications and Data Clustering: Algorithms and Applications.
What are the main lessons learned in data classification and data clustering that you can share with us?
Charu Aggarwal: The most important lesson, which is perhaps true for all of data mining applications, is that feature extraction, selection and representation are extremely important. It is all too often that we ignore these important aspects of the data mining process.
Q2. How Data Classification and Data Clustering relate to each other?
Charu Aggarwal: Data classification is the supervised version of data clustering. Data clustering is about dividing the data into groups of similar points. In data classification, examples of groups of points are made available to you. Then, for a given test instance, you are supposed to predict which group this point might belong to.
In the latter case, the groups often have a semantic interpretation. For example, the groups might correspond to fraud/not fraud labels in a credit-card application. In many cases, it is natural for the groups in classification to be clustered as well. However, this is not always the case.
Some methods such as semi-supervised clustering/classification leverage the natural connections between these problems to provide better quality results.
Q3. Can data classification and data clustering be useful also for large data sets and data streams? If yes, how?
Charu Aggarwal: Data clustering is definately useful for large data sets, because clusters can be viewed as summaries of the data. In fact, a particular form of fine-grained clustering, referred to as micro-clustering, is commonly used for summarizing high-volume streaming data in real time. These summaries are then used for many different applications, such as first-story detection, novelty detection, prediction, and so on.
In this sense, clustering plays an intermediate role in enabling other applications for large data sets.
Classification can also be used to generate different types of summary information, although it is a little less common. The reason is that classification is often used as the end-user application, rather than as an intermediate application
like clustering. Therefore, big-data serves as a challenge and as an opportunity for classification.
It serves as a challenge because of obvious computational reasons. It serves as an opportunity because you can build more complex and accurate models with larger data sets without creating a situation, where the model inadvertently overfits to the random noise in the data.
Q4. How do you typically extract “information” from Big Data?
Charu Aggarwal: This is a highly application-specific question, and it really depends on what you are looking for. For example, for the same stream of health-care data, you might be looking for different types of information, depending on whether you are trying to detect fraud, or whether you are trying to discover clinical anomalies. At the end of the day, the role of the domain expert can never be discounted.
However, the common theme in all these cases is to create a more compressed, concise, and clean representation into one of the data types we all recognize and know how to process. Of course, this step is required in all data mining applications, and not just big data applications. What is different in big data applications, is that sometimes the data is stored in a distributed sense, and even simple processing becomes more challenging.
For example, if you look at Google’s original MapReduce framework, it was motivated by a need to efficiently perform operations that are almost trivial for smaller data sets, but suddenly become very expensive in the big-data setting.
Q5. What are the typical problems and scenarios when you cluster multimedia, text, biological, categorical, network, streams, and uncertain data?
Charu Aggarwal: The heterogeneity of the data types causes significant challenges.
One problem is that the different data types may often be mixed, as a result of which the existing methods can sometimes not be used directly. Some common scenarios in which such data types arise are photo/music/video-sharing (multimedia), healthcare (time-series streams and biological), and social networks. Among these different data types, the probabilistic (uncertain) data types does not seem to have graduated from academia into industry very well. Of course, it is a new area and there is a lot of active research going on. The picture will become clearer in a few years.
Q6. How effective are today ́s clustering algorithms?
Charu Aggarwal: Clustering problems have become increasingly effective in recent years because of advances in high-dimensional methods. In the past, when the data was very high-dimensional most existing methods work poorly because of locally irrelevant attributes and concentration effects. These are collectively referred to as the curse of dimensionality. Techniques such as subspace and projected clustering have been introduced to discover clusters in lower dimensional views of the data. One nice aspect of this approach is that some variations of it are highly interpretable.
Q7. What is in common between pattern recognition, database analytics, data mining, and machine learning?
Charu Aggarwal: They really do the same thing, which is that of analyzing and gleaning insights from data. It is just that the styles and emphases are different in various communities. Database folks are more concerned
about scalability. Pattern recognition and machine learning folks are somewhat more theoretical. The statistical folks tend to use their statistical models. The data mining community is the most recent one, and it was formed to create a common meeting ground for these diverse communities.
The first KDD conference was held in 1995, and we have come a long way since then towards integration. I believe that the KDD conference has played a very major role in the amalgamation of these communities. Today, it is actually possible for the folks from database and machine learning communities to be aware of each other’s work. This was not quite true 20 years ago.
Q8. What are the most “precise” methods in data classification?
Charu Aggarwal: I am sure that you will find experts who are willing to swear by a particular model. However, each model comes with a different set of advantages over different data sets. Furthermore, some models, such as univariate decision trees and rule-based methods, have the advantage of being interpretable even when they are outperformed by other methods. After all, analysts love to know about the “why” aside from the “what.”
While I cannot say which models are the most accurate (highly data specific), I can certainly point to the most “popular” ones today from a research point of view. I would say that SVMs, and neural networks (deep learning) are the most popular classification methods. However, my personal experience has been mixed.
While I have found SVMs to work quite well across a wide variety of settings, neural networks are generally less robust. They can easily over fit to noise or show unstable performance over small ranges of parameters. I am watching the debate over deep learning with some interest to see how it plays out.
Q9. When to use Mahout for classification? and What is the advantage of using Mahout for classification?
Charu Aggarwal: Apache Mahout is a scalable machine learning environment for data mining applications. One distinguishing feature of Apache Mahout is that it builds on top of distributed infrastructures like MapReduce, and enables easy building of machine learning applications. It includes libraries of various operations and applications.
Therefore, it reduces the effort of the end user beyond the basic MapReduce framework. It should be used in cases, where the data is large enough to require the use of such distributed infrastructures.
Q10. What are your favourite success stories in Data Classifications and/or Data Clustering?
Charu Aggarwal: One of my favorite success stores is in the field of high dimensional data, where I explored the effect of locally irrelevant dimensions and concentration effects on various data mining algorithms.
I designed a suite of algorithms for such high-dimensional tasks as clustering, similarity search, and outlier detection.
The algorithms continue to be relevant even today, and we have even generalized some of these results to big-data (streaming) scenarios and other application domains, such as the graph and text domains.
Qx Anything else you wish to add?
Charu Aggarwal: Data mining and data sciences are at exciting cross-roads today. I have been working in this field since 1995, and I have never seen as much excitement about data science in my first 15 years, as I have seen
in the last 5. This is truly quite amazing!
Charu C. Aggarwal is a Research Scientist at the IBM T. J. Watson Research Center in Yorktown Heights, New York.
He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from Massachusetts Institute of Technology in 1996.
His research interest during his Ph.D. years was in combinatorial optimization (network flow algorithms), and his thesis advisor was Professor James B. Orlin.
He has since worked in the field of performance analysis, databases, and data mining. He has published over 200 papers in refereed conferences and journals, and has applied for or been granted over 80 patents. He is author or editor of nine books.
Because of the commercial value of the above-mentioned patents, he has received several invention achievement awards and has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research Division Award (2008) for his scientific contributions to data stream research.
He has served on the program committees of most major database/data mining conferences, and served as program vice-chairs of the SIAM Conference on Data Mining, 2007, the IEEE ICDM Conference, 2007, the WWW Conference 2009, and the IEEE ICDM Conference, 2009. He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering Journal from 2004 to 2008. He is an associate editor of the ACM TKDD Journal, an action editor of the Data Mining and Knowledge Discovery Journal, an associate editor of the ACM SIGKDD Explorations, and an associate editor of the Knowledge and Information Systems Journal.
He is a fellow of the IEEE for “contributions to knowledge discovery and data mining techniques”, and a life-member of the ACM.
– MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
Appeared in:OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. Download: PDF Version
Follow ODBMS.org on Twitter: @odbmsorg
“The IoT will have many, many trillions of connections, particularly considering it’s not just the devices that are connected, but people, organizations, applications, and the underlying network” –Emil Eifrem.
I have interviewed Emil Eifrem, CEO of Neo Technology. Among the topics we discussed: Graph Databases, the new release of Neo4j, and how graphs relate to the Internet of Things.
Q1. Michael Blaha said in an interview: “The key is the distinction between being occurrence-oriented and schema-oriented. For traditional business applications, the schema is known in advance, so there is no need to use a graph database which has weaker enforcement of integrity. If instead, you’re dealing with at best a generic model to which it conforms, then a schema-oriented approach does not provide much. Instead a graph-oriented approach is more natural and easier to develop against.”
What is your take on this?
Emil Eifrem: While graphs do excel where requirements and/or data have an element of uncertainty or unpredictability, many of the gains that companies experience from using graph databases don’t require the schema to be dynamic. What make graph databases suitable is problems where relationships between the data, and not just the data, matter.
That said, I agree that a graph-oriented approach is incredibly natural and easier to develop against. We see this again and again, with development cycle times reduced by 90% in some cases. Performance is also a common driver.
Q2. You recently released Neo4j 2.2. What are the main enhancements to the internal architecture you have done in Neo4j 2.2 and why?
Emil Eifrem: Neo4j 2.2 makes huge strides in performance and scalability. Performance of Cypher queries is up to 100 times faster than before, thanks to a Cost-Based Optimizer, that includes Visual Query Plans as a tuning aid.
Read scaling for highly concurrent workloads can be as much as 10 times higher with the new In-Memory Page Cache, which helps users take better advantage of modern hardware. Write scaling is also significantly higher for highly concurrent transactional workloads. Our engineering team found some clever ways of increasing throughput, by buffering writes to a single transaction log, rather than blocking transactions one at a time where each transaction committed to two transaction logs (graph and index). The last internal architecture change was to integrate a bulk loader into the product. It’s blindingly fast. We use Neo4j internally and a load that took many hours transactionally runs in four minutes with the bulk loader. It operates at throughputs of a million records per second, even for extremely large graphs.
Besides all of the internal improvements, this release also includes a lot of the top-requested developer features from the community in the developer tooling, such as built-in learning and visualization improvements.
Q3. With Neo4j 2.2 you introduce a new page cache. Could you please explain what is new with this page cache?
Emil Eifrem: Neo4j has two levels of caching. In earlier versions, Neo4j delegated the lower level cache to the operating system, by memory mapping files. OS memory mapping is optimized for a wide range of workloads.
As users have continued to push into bigger & bigger workloads, with more and more data, we decided it was time to build a specialized cache, built specially for Neo4j workloads. The page cache uses an LRU-K algorithm, and is auto-configured and statistically optimized, to deliver vastly improved scalability in highly concurrent workloads. The result is much better read scaling in multi-core environments that maintains the ultra-fast performance that’s been the hallmark of Neo4j.
Q4. How is this new page cache helping overcoming some of the limitations imposed by current IO systems? Do you have any performance measurements to share with us?
Emil Eifrem: The benefits kick in progressively as you add cores and threads. In the labs we’ve seen up to 10 times higher read throughput compared to previous versions of Neo4j, in large simulations. We also have some very positive reports from the field indicating similar gains.
Q5. What enhancements did you introduce in Neo4j 2.2 to improve both transactional and batch write performance?
Emil Eifrem: Write throughput has gone up because of two improvements. One is the fast-write buffering architecture. This lets multiple transactions flush to disk at the same time, in a way that improves throughput without sacrificing latency. Secondly, there is a change to the structure of the transaction logs. Prior to 2.2, writes used to be committed one at a time with two-phase commit for both the graph and its index. With the unified transaction log, multiple writes can be committed together, using a more efficient approach than before, for ensuring ACIDity between the graph and indexes.
For bulk initial loading, there’s something entirely different, a utility called “neo4j-import” that’s designed to load data at extremely high rates. We’ve seen complex graphs with tens of billions of nodes and relationships loading at rates of 1M records per second.
Q6. You introduced a cost-based query planner, Cypher, which uses statistics about data sets. How does it work? What statistics do you use about data sets?
Emil Eifrem: In 2.2 we introduced both a cost-based optimizer and a visual query planner for Cypher.
The cost-based optimizer gathers statistics such as the total number of nodes by label and calculates the most efficient query path based not just on information about the question being asked, but information about the data patterns in the graph. While some Cypher read queries perform just as fast as they did before, others can be 100 times faster.
The visual query planner provides insight into how the Neo4j optimizer will execute a query, helping users write better and faster queries because Cypher is more transparent.
Q7. Gartner recently said that “Graph analysis is possibly the single most effective competitive differentiator for organizations pursuing data-driven operations and decisions after the design of data capture.”
Graph analysis does not necessarily imply the need a dedicated graph database. Do you have any comment?
Emil Eifrem: Gartner is making a business statement, not a technology statement. Graph analysis refers to a business activity. The best tool we know for carrying out relevant and valuable graph analysis problems is graph databases.
The real value of using a dedicated graph database is in the power to use data relationships easily. Because data and its relationships are stored and processed as they naturally occur in a graph database, elements such as index-free adjacency lead to ultra-accurate and speedy responses to even the most complex queries.
Businesses that build operational applications on the right graph database experience measurable benefits: better performance overall, more competitive applications that incorporate previously impossible-to-include real-time features, easier development cycles that lead to faster time-to-market, and higher revenues thanks to speedier innovation and sharper fraud detection.
Q8. How do you position Neo4j with respect to RDBMS which handles XML and RDF data, and to NoSQL databases which handle graph-based data?
Emil Eifrem: While it is possible to use Neo4j to model an RDF-style graph, our observation is that most people who have tried doing this have found RDF much more difficult to learn and use than the property graph model and associated query methods. This is unsurprising given that RDF is a web standard created by an organization chartered with world wide web standards (the W3C), which has a very different set of requirements than organizations do for their enterprise databases. We saw the need to invent a model suited for persistent data inside of an enterprise, for use as a database data model.
As for XML, that again is a great data transport and exchange mechanism, and is conceptually similar to what’s done in document databases. But it’s not really a suitable model for database storage: if what you care about is relating things across your network. XML databases experienced some hype early on but never caught on.
While we’re talking about document databases … there’s another point here worth drilling into, which is not the data model, but the consistency model. If you’re dealing with isolated documents, then it’s okay for the scope of the transaction to be limited to one object. This means eventual consistency is okay. With graphs, because things relate to one another, if you don’t ensure that related things get written to the database on an “all or nothing” basis, then you can very quickly corrupt your graph. This is why BASE is sufficient for other forms of NoSQL, but not for graphs.
Q9. Could you please give use some examples on how graph databases could help supporting the Internet of Things (IoT)?
Emil Eifrem: We love Neo4j for the Internet of Things, but we’d like to see it renamed to Internet of Connected Things! After all, the value is in the connections between all of the things, that is, the connections and interactions between the devices.
Two points are worth remembering:
- Devices in isolation bring little value to the IoT; rather, it’s the connections between devices that truly bring forth the latent possibilities.
- We’re not just speaking about tracking billions of connections; the IoT will have many, many trillions of connections, particularly considering it’s not just the devices that are connected, but people, organizations, applications, and the underlying network.
Understanding and managing these connections will be at least as important for businesses as understanding and managing the devices themselves. Imagination is key to unlocking the value of connected things. For example, in a telecommunications or aviation network, the questions, “What cell tower is experiencing problems?” and “Which plane will arrive late?” can be answered much more accurately by understanding how the individual components are connected and impact one another. Understanding connections is also key to understanding dependencies and uncovering cascading impacts.
Q10. What are your top 3 favourite case studies for Neo4j?
Emil Eifrem: My top three use cases are:
- Real-time Recommendations – Personalize product, content and service offers by leveraging data relationships. (dynamic pricing, financial services products, online retail, routing & networks)
- Fraud Detection – Improve existing fraud detection resulting methods by uncovering hidden relationships to discover fraud rings and indirections. (financial services, health care, government, gaming)
- Master Data Management – Improve business outcome through storage and retrieval of complex and ‘hierarchical’ master data. Top MDM data sets across our customer base include: customer (360 degree view), organizational hierarchy, employee (HR), product / product line management, metadata management / data governance, and CMDB, as well as digital assets. (financial services, telecommunications, insurance, agribusiness)
Q11. Anything else you wish to add?
Emil Eifrem: Yes, one thing. As much as we’re a product company, we are very passionate educators and evangelists. Our mission is to help the world make sense of data. We discovered that graphs are an amazing way of doing that, and we’re working hard to share that with the world.
For anyone interested in learning more, our web site offers a lot of great learning resources: talks, examples, free training… we’ve even worked with O’Reilly to offer up their Graph Databases e-book. By the time this article is published, the second edition should be up on http://graphdatabases.com. Any of your readers who are interested, are welcome to come and learn, and become part of the amazing & rapidly growing worldwide graph database community.
It’s been great to speak with you!
Emil is the founder of the Neo4j open source graph database project, the most widely deployed graph database in the world. As the CEO of Neo4j’s commercial sponsor Neo Technology Emil spreads the word about the powers of graphs everywhere. Emil is a co-author of the O’Reilly Media book “Graph Databases” and presents regularly at conferences around the world such as JAOO, JavaOne, QCon, and OSCON.
Follow ODBMS.org on Twitter: @odbmsorg
“Today, we’re storing and processing tens of petabytes of data on a daily basis, which poses the big challenge in building a highly reliable and scalable data infrastructure.”–Krishna Gade.
I have interviewed Krishna Gade, Engineering Manager on the Data team at Pinterest.
Q1. What are the main challenges you are currently facing when dealing with data at Pinterest?
Krishna Gade: Pinterest is a data product and a data-driven company. Most of our Pinner-facing features like recommendations, search and Related Pins are created by processing large amounts of data every day. Added to this, we use data to derive insights and make decisions on products and features to build and ship. As Pinterest usage grows, the number of Pinners, Pins and the related metadata are growing rapidly. Today, we’re storing and processing tens of petabytes of data on a daily basis, which poses the big challenge in building a highly reliable and scalable data infrastructure.
On the product side, we’re curating a unique dataset we call the ‘interest graph’ which captures the relationships between Pinners, Pins, boards (collections of Pins) and topic categories. As Pins are visual bookmarks of web pages saved by our Pinners, we can have the same web page Pinned many different times. One of the problems we try to solve is to collate all the Pins that belong to the same web page and aggregate all the metadata associated with them.
Visual discovery is an important feature in our product. When you click on a Pin we need to show you visually related Pins. In order to do this we extract features from the Pin image and apply sophisticated deep learning techniques to suggest Pins related to the original. There is a need to build scalable infrastructure and algorithms to mine and extract value from this data and apply to our features like search, recommendations etc.
Q2. You wrote in one of your blog posts that “data-driven decision making is in your company DNA”. Could please elaborate and explain what do you mean with that?
Krishna Gade: It starts from the top. Our senior leadership is constantly looking for insights from data to make critical decisions. Every day, we look at the various product metrics computed by our daily pipelines to measure how the numerous product features are doing. Every change to our product is first tested with a small fraction of Pinners as an A/B experiment, and at any given time we’re running hundreds of these A/B experiments. Over time data-driven decision making has become an integral part of our culture.
Q3. Specifically, what do you use Real-time analytics for at Pinterest?
Krishna Gade: We build batch pipelines extensively throughout the company to process billions of Pins and the activity on them. These pipelines allow us to process vast amounts of historic data very efficiently and tune and personalize features like search, recommendations, home feed etc. However these pipelines don’t capture the activity happening currently – new users signing up, millions of repins, clicks and searches. If we only rely on batch pipelines, we won’t know much about a new user, Pin or trend for a day or two. We use real-time analytics to bridge this gap.
Our real-time data pipelines process user activity stream that includes various actions taken by the Pinner (repins, searches, clicks, etc.) as they happen on the site, compute signals for Pinners and Pins in near real-time and make these available back to our applications to customize and personalize our products.
Q4 Could you pls give us an overview of the data platforms you use at Pinterest?
Krishna Gade: We’ve used existing open-source technologies and also built custom data infrastructure to collect, process and store our data. We built a logging agent Singer, deployed on all of our web servers that’s constantly pumping log data into Kafka, which we use as a log transport system. After the logs reach Kafka, they’re copied into Amazon S3 by our custom log persistence service called Secor. We built Secor to ensure 0-data loss and overcome the weak eventual consistency model of S3.
After this point, our self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing. All our large scale batch pipelines run on Hadoop, which is the core data infrastructure we depend on for improving and observing our product. Our engineers use either Hive or Cascading to build the data pipelines, which are managed by Pinball – a flexible workflow management system we built. More recently, we’ve started using Spark to support our machine learning use-cases.
Q5. You have built a real-time data pipeline to ingest data into MemSQL using Spark Streaming. Why?
Krishna Gade: As of today, most of our analytics happens in the batch processing world. All the business metrics we compute are powered by the nightly workflows running on Hadoop. In the future our goal is to be able to consume real-time insights to move quickly and make product and business decisions faster. A key piece of infrastructure missing for us to achieve this goal was a real-time analytics database that can support SQL.
We wanted to experiment with a real-time analytics database like MemSQL to see how it works for our needs. As part of this experiment, we built a demo pipeline to ingest all our repin activity stream into MemSQL and built a visualization to show the repins coming from the various cities in the U.S.
Q6. Could you pls give us some detail how is it implemented?
Krishna Gade: As Pinners interact with the product, Singer agents hosted on our web servers are constantly writing the activity data to Kafka. The data in Kafka is consumed by a Spark streaming job. In this job, each Pin is filtered and then enriched by adding geolocation and Pin category information. The enriched data is then persisted to MemSQL using MemSQL’s spark connector and is made available for query serving. The goal of this prototype was to test if MemSQL could enable our analysts to use familiar SQL to explore the real-time data and derive interesting insights.
Q7. Why did you choose MemSQL and Spark for this? What were the alternatives?
Krishna Gade: I led the Storm engineering team at Twitter, and we were able to scale the technology for hundreds of applications there. During that time I was able to experience both good and bad aspects of Storm.
When I came to Pinterest, I saw that we were beginning to use Storm but mostly for use-cases like computing the success rate and latency stats for the site. More recently we built an event counting service using Storm and HBase for all of our Pin and user activity. In the long run, we think it would be great to consolidate our data infrastructure to a fewer set of technologies. Since we’re already using Spark for machine learning, we thought of exploring its streaming capabilities. This was the main motivation behind using Spark for this project.
As for MemSQL, we were looking for a relational database that can run SQL queries on streaming data that would not only simplify our pipeline code but would give our analysts a familiar interface (SQL) to ask questions on this new data source. Another attractive feature about MemSQL is that it can also be used for the OLTP use case, so we can potentially have the same pipeline enabling both product insights and user-facing features. Apart from MemSQL, we’re also looking at alternatives like VoltDB and Apache Phoenix. Since we already use HBase as a distributed key-value store for a number of use-cases, Apache Phoenix which is nothing but a SQL layer on top of HBase is interesting to us.
Q8. What are the lessons learned so far in using such real-time data pipeline?
Krishna Gade: It’s early days for the Spark + MemSQL real-time data pipeline, so we’re still learning about the pipeline and ingesting more and more data. Our hope is that in the next few weeks we can scale this pipeline to handle hundreds of thousands of events per second and have our analysts query them in real-time using SQL.
Q9. What are your plans and goals for this year?
Krishna Gade: On the platform side, our plan to is to scale real-time analytics in a big way in Pinterest. We want to be able to refresh our internal company metrics, signals into product features at the granularity of seconds instead of hours. We’re also working on scaling our Hadoop infrastructure especially looking into preventing S3 eventual consistency from disrupting the stability of our pipelines. This year should also see more open-sourcing from us. We started the year by open-sourcing Pinball, our workflow manager for Hadoop jobs. We plan to open-source Singer our logging agent sometime soon.
One the product side, one of our big goals is to scale our self-serve ads product and grow our international user-base. We’re focusing especially on markets like Japan and Europe to grow our user-base and get more local content into our index.
Qx. Anything else you wish to add?
Krishna Gade: For those who are interested in more information, we share latest from the engineering team on our Engineering blog. You can follow along with the blog, as well as updates on our Facebook Page. Thanks a lot for the opportunity to talk about Pinterest engineering and some of the data infrastructure challenges.
Krishna Gade is the engineering manager for the data team at Pinterest. His team builds core data infrastructure to enable data driven products and insights for Pinterest. They work on some of the cutting edge big data technologies like Kafka, Hadoop, Spark, Redshift etc. Before Pinterest, Krishna was at Twitter and Microsoft building large scale search and data platforms.
–Singer, Pinterest’s Logging Infrastructure (LINK to SlideShares)
–Introducing Pinterest Secor (LINK to Pinterest engineering blog)
–MemSQL’s spark connector (memsql/memsql-spark-connector GitHub)
Follow ODBMS.org on Twitter: @odbmsorg
“Predictive analytics is a market which has been lagging the growth of big data – full of tools developed twenty or more years ago which simply weren’t built with today’s challenges in mind.”–Walter Maguire and Indrajit Roy
HP announced HP Distributed R. I wanted to learn more about it, and I have interviewed Walter Maguire, Chief Field Technologist with the HP Big Data Group,and Indrajit Roy, principal researcher at HP, who provided the answers with the assistance of Malu G. Castellanos, manager and technical contributor in the Vertica group of Hewlett Packard.
Q1. HP announced HP Distributed R. What is the difference with the standard R?
Maguire, Roy: R is a very popular statistical analysis tool. But it was conceived before the era of Big Data. It is single threaded and cannot analyze massive datasets. HP Distributed R brings scalability and high performance to R users. Distributed R is not a competing version of R. Rather, it is an open source package that can be installed on vanilla R. Once installed, R users can leverage the pre-built distributed algorithms and the Distributed R API to benefit from cluster computing and dramatically expand the scale of the data they are able to analyze.
Q2. How does HP Distributed R work?
Maguire, Roy: HP Distributed R has three components:
(1) an open source distributed runtime that executes R functions,
(2) a fast, parallel data loader to ingest data from different sources such as the Vertica database, and
(3) a mechanism to deploy the model in the Vertica database.
The distributed runtime is the core of HP Distributed R.
It starts multiple R workers on the cluster, breaks the user’s program into multiple independent tasks, and executes them in parallel on cluster. The runtime hides much of the internal data communication. For example, the user does not need to know how many machines make up the cluster and where data resides in the cluster. In essence, it allows any R algorithm which has been ported to use distributed R to act like a massively parallel system.
Q3. Could you tell us some details on how users write Distributed R programs to benefit from scalability and high-performance?
Maguire, Roy: A programmer can use HP Distributed R’s API to write distributed applications. The API consists of two types of language constructs. First, the API provides distributed data-structures. These are really distributed versions of R’s common data structures such as array, data.frame, and list. As an example, distributed arrays can store 100s of gigabytes of data in-memory and across a cluster. Second, the API also provides a way for users to express parallel tasks on distributed data structures. While R users can write their own custom distributed applications using this API, we expect most R users to be interested in built-in algorithms. Just like R has built-in packages such as kmeans for clustering and glm for regression, HP Distributed R provides distributed versions of common clustering, classification, and graph algorithms.
Q4. R has already many packages that provide parallelism constructs. How do they fit into Distributed R?
Maguire, Roy: Yes, R has a number of open source parallel packages. Unfortunately, none of the packages can handle hundreds or thousands of gigabytes of data or has built-in distributed data structures and computational algorithms. HP Distributed R fills that functionality gap, along with enterprise support – which is critical for customers before they deploy R in production systems.
Also, it’s worth noting that using distributed R doesn’t prevent an R programmer from using their current libraries in their current environment. Those libraries just won’t gain the scale and performance benefits of distributed R.
Q5. Why is there a need to streamline language constructs in order to move R forward in the era of Big Data?
Maguire, Roy: The open source community has done a tremendous job of advancing R—different algorithms, thousands of packages, and a great user community.
However, in the case of parallelism and Big Data there is a confusing mix of R extensions. These packages have overlapping functionality, in many cases completely different syntax, and none of them solve all the issues users face with Big Data. We need to ensure that future R contributors can use a standard set of interfaces and write applications that are portable across different backend packages. This is not just our concern, but something that members of R-core and other companies are interested in as well. Our goal is to help the open source community streamline some of the language constructs so they can spend more time answering analytic questions and less time trying to make sense of the different R extensions.
Q6. What are in your opinion the strengths and weaknesses of the current R parallelism constructs?
Maguire, Roy: Some packages such as “parallel” are very useful. In the case of “parallel”, it is accessible to most R users, already ships with R, and it is easy to express embarrassingly parallel applications (those in which individual tasks don’t need to coordinate with each other). Still, parallel and other packages lack concepts such as distributed data-structures which can provide the much needed performance on massive data. Additionally, it is not clear if the infrastructure implementing existing parallel constructs have been tested on large, multi-gigabyte data.
Q7. When MPI and R wrappers around MPI are a good option?
Maguire, Roy: MPI is a powerful tool. It is widely used in the scientific and high performance computing domain.
If you have an existing MPI application and want to expose it to R users, the right thing is to make it available thought R wrappers. It does not make sense to rewrite these optimized scientific applications in R or any other language.
Q8. Why for in-memory processing, adding some form of distributed objects in R can potentially improve performance?
Maguire, Roy: In-memory processing represents a big change moving forward. The key idea is to remove bottlenecks such as the disk which slows down applications. In HP Distributed R, distributed objects provide a way to store and manipulate data in-memory. Without these distributed objects, data on worker nodes will be ephemeral and users will not be able to reference remote data. Worse, there will be performance issues. For example, many machine learning applications are iterative and need to execute tasks for multiple rounds. Without the concept of distributed objects, applications would end up re-broadcasting data to remote servers in each round. This results in a lot of data movement and very poor performance. Incidentally, this is a good example of why we undertook Distributed R in the first place. Implementing the bare bones of a parallel application is relatively straightforward, but there are thousands or tens of thousands of edge cases which arise once said application is in use due to the nature of distributed processing.
This is when the value of a cohesive parallel framework like Distributed R becomes very apparent.
Q9. Do you think that by using simple parallelism constructs, such as lapply, that operate on distributed data structures, may make it easier to program in R?
Maguire, Roy: Yes, we need to ensure that R users have a simple API to express parallelism. Implementing machine learning algorithms requires deep knowledge. Couple it with parallelism, and you are left with a very small set of people who can really write such applications. To ensure that R users continue to contribute, we need an API which is familiar to current R users. Constructs from the apply() family are a good choice. In fact we are exploring these kind of APIs with members of R-core.
Q10. R is an open-source software project. What about HP Distributed R?
Maguire, Roy: Just like R, HP Distributed R is a GPL licensed open source project. Our code is available on GitHub and we try to release a new version every few months. We provide enterprise support for customers who need it. If you have HP Vertica enterprise edition you will see additional benefits of integrating Vertica with Distributed R.
For example, you can build a machine learning model in Distributed R, and then deploy it in Vertica to score data real time in an analytic application – something many of our customers need.
Qx Anything you with to add?
Maguire, Roy: Predictive analytics is a market which has been lagging the growth of big data – full of tools developed twenty or more years ago which simply weren’t built with today’s challenges in mind.
With HP Distributed R we are not only providing users with scalable and high performance solutions, but also making a difference in the open source community. We look forward to nurturing contributors who can straddle the world of data science and distributed systems.
A core tenet of our big data strategy is to create a positive developer experience, and we are very focused on technology development and fulfillment choices which support that goal.
Walter Maguire has twenty-eight years of experience in analytics and data technologies.
He practiced data science before it had a name, worked with big data when “big” meant a megabyte, and has been part of the movement which has brought data management and analytic technologies from back-office, skunk works operations to core competencies for the largest companies in the world. He has worked as a practitioner as well as a vendor, working with analytics technologies ranging from SAS and R to data technologies such as Hadoop, RDBMS and MPP databases. Today, as Chief Field Technologist with the HP Big Data Group, Walt has the unique pleasure of addressing strategic customer needs with Haven, the HP big data platform.
Indrajit Roy is a principal researcher at HP. His research focusses on next generation distributed systems that solve the challenges of Big Data. Indrajit’s pet project is HP Distributed R, a new open source product that helps data scientists. Indrajit has multiple patents, publications and a best paper award. In the past he worked on computer security and parallel programming. Indrajit received his PhD in computer science from the University of Texas at Austin.
- Download HP Distributed R V1.0
- HP Distributed R Data Sheet
- Online Documentation
- HP Distributed R Source on GitHub
- Big Data Predictive Analytics Survey Infographic
- Distributed Machine Learning and Graph Processing with Sparse Matrices
- Using R for Iterative and Incremental Processing
Follow ODBMS.org on Twitter: @odbmsorg
“Some believe that the Gaia data will revolutionize astronomy! Only time will tell if that is true, but it is clear that it will be a treasure trove for astronomers for decades to come.”–Dr. Uwe Lammers.
“The Gaia mission is considered to be the largest data processing challenge in astronomy.”–Vik Nagjee
In December of 2013, the European Space Agency (ESA) launched a satellite called Gaia on a five-year mission to map the galaxy and learn about its past.
The Gaia mission is considered by the experts “the biggest data processing challenge to date in astronomy”.
I recall here the Objectives of the Gaia Project (source ESA Web site):
“To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.”
I have been following the GAIA mission since 2011, and I have reported it in two interviews until now. This is the third interview of the series, the first one after the launch.
The interview is with Dr. Uwe Lammers, Gaia Science Operations Manager at the European Space Agency, and Vik Nagjee, Product Manager for Data Platforms at InterSystems.
Q1. Could you please elaborate in some detail what is the goal and what are the expected results of the Gaia mission?
Uwe Lammers: We are trying to construct the most consistent, most complete and most accurate astronomical catalog ever done. Completeness means to observe all objects in the sky that are brighter than a so-called magnitude limit of 20. These are mostly stars in our Milky Way up to 1.5 billion in number. In addition, we expect to observe as many as 10 million other galaxies, hundreds of thousands of celestial bodies in our solar system (mostly asteroids), tens of thousands of new exo-planets, and more. Some believe that the Gaia data will revolutionize astronomy! Only time will tell if that is true, but it is clear that it will be a treasure trove for astronomers for decades to come.
Vik Nagjee: The data collected from Gaia will ultimately result in a three-dimensional map of the Milky Way, plotting over a billion celestial objects at a distance of up to 30,000 light years. This will reveal the composition, formation and evolution of the Galaxy, and will enable the testing of Albert Einstein’s Theory of Relativity, the space-time continuum, and gravitational waves, among other things. As such, the Gaia mission is considered to be the largest data processing challenge in astronomy.
Orbiting the Lagrange 2 (L2) point, a fixed spot 1.5 million kilometers from Earth, Gaia will measure the position, movement, and brightness of more than a billion celestial objects, looking at each one an average of 70 times over the course of five years. Gaia’s measurements will be much more complete, powerful, and accurate than anything that has been done before. ESA scientists estimate that Gaia will find hundreds of thousands of new celestial objects, including extra-solar planets, and the failed stars known as brown dwarfs. In addition, because Gaia can so accurately measure the position and movement of the stars, it will provide valuable information about the galaxy’s past – and future – evolution.
Read more about the Gaia mission here.
Q2. What is the size and structure of the information you analysed so far?
Uwe Lammers: From the start of the nominal mission on 25 July until today, we have received about 13 terabytes of compressed binary telemetry from the satellite. The daily pipeline running here at the Science Operations Centre (SOC) has processed all this and generated about 48 TB of higher-level data products for downstream systems.
At the end of the mission, the Main Database (MDB) is expected to hold more than 1 petabyte of data. The structure of the data is complex and this is one of the main challenges of the project. Our data model contains about 1,500 tables with thousands of fields in total, and many inter-dependencies. The final catalog to be released sometime around 2020 will have a simpler structure, and there will be ways to access and work with it in a convenient form, of course.
Q3. Since the launch of Gaia in December 2013, what intermediate results did you obtain by analysing the data received so far?
Uwe Lammers: Last year we found our first supernova (exploding star) with the prototype of the so-called Science Alert pipeline. When this system is fully operational, we expect to find several of these per day. The recent detection of a micro-lensing event was another nice demonstration of Gaia’s capabilities.
Q4. Did you find out any unexpected information and/or confirmation of theories by analysing the data generated by Gaia so far?
Uwe Lammers: It is still too early in the mission to prove or disprove established astronomical theories. For that we need to collect more data and do much more processing. The daily SOC pipeline is only one, the first part, of a large distributed system that involves five other Data Processing Centres (DPCs), each running complex scientific algorithms on the data. The whole system is designed to improve the results iteratively, step by step, until the final accuracy has been reached. However, there will certainly be intermediate results. One simple example of an unexpected early finding is that Gaia gets hit by micro-meteoroids much more often than pre-launch estimates predicted.
Q5. Could you please explain at some high level the Gaia’s data pipeline?
Uwe Lammers: Hmmm, that’s not easy to do in a few words. The daily pipeline at the SOC converts compact binary telemetry of the satellite into higher level products for the downstream systems at the SOC and the other processing centres. This sounds simple, but it is not – mainly because of the complex dependencies and the fact that data does not arrive from the satellite in strict time order. The output of the daily pipeline is only the start as mentioned above.
From the SOC, data gets sent out daily to the other DPCs, which perform more specialized processing. After a number of months we declare the current data segment as closed, receive the outputs from the other DPCs back at the SOC, and integrate all into a coherent next version of the MDB. The creation of it marks the end of the current iteration and the start of a new one. This cyclic processing will go on for as many iterations as needed to converge to a final result.
An important key process is the Astrometric Global Iterative Solution (AGIS), which will give us the astrometric part of the catalog. As the name suggests, it is in itself an iterative process and we run it likewise here at the SOC.
Vik Nagjee: To add on to what Dr. Lammers describes, Gaia data processing is handled by a pan-European collaboration, the Gaia Data Processing and Analysis Consortium (DPAC), and consists of about 450 scientists and engineers from across Europe. The DPAC is organized into nine Coordination Units (CUs); each CU is responsible for a specific portion of the Gaia data processing challenge.
One of the CUs – CU3: Core Processing – is responsible for unpacking, decompressing, and processing the science data retrieved from the satellite to provide rapid monitoring and feedback of the spacecraft and payload performances at the ultra-precise accuracy levels targeted by the mission. In other words, CU3 is responsible for ensuring the accuracy of the data collected by Gaia, as it is being collected, to ensure the accuracy of the eventual 3-D catalog of the Milky Way.
Over its lifetime, Gaia will generate somewhere between 500,000 to 1 million GB of data. On an average day, approximately 50 million objects will “transit” Gaia’s field of view, resulting in about 285 GB of data. When Gaia is surveying a densely populated portion of the galaxy, the daily amount could be 7 to 10 times as much, climbing to over 2,000 GB of data in a day.
There is an eight-hour window of time each day when raw data from Gaia is downloaded to one of three ground stations.
The telemetry is sent to the European Space Astronomy Centre (ESAC) in Spain – the home of CU3: Core Processing – where the data is ingested and staged.
The initial data treatment converts the data into the complex astrometric data models required for further computation. These astrometric objects are then sent to various other Computational Units, each of which is responsible for looking at different aspects of the data. Eventually the processed data will be combined into a comprehensive catalog that will be made available to astronomers around the world.
In addition to performing the initial data treatment, ESAC also processes the resulting astrometric data with some complex algorithms to take a “first-look” at the data, making sure that Gaia is operating correctly and sending back good information. This processing occurs on the Initial Data Treatment / First Look (IDT/FL) Database; the data platform for the IDT/FL database is InterSystems Caché.
Q6. Observations made and conclusions drawn are only as good as the data that supports them. How do you evaluate the “quality” of the data you receive? and how do you discard the “noise” from the valuable information?
Uwe Lammers: A very good question! If you refer to the final catalog, this is a non-trivial problem and a whole dedicated group of people is working on it. The main issue is, of course, that we do not know the “true” values as in simulations. We work with models, e.g., models of the stars’ positions and the satellite orientation. With those we can predict the observations, and the difference between the predicted and the observed values tells us how well our models represent reality. We can also do consistency checks. For instance, we do two runs of AGIS, one with only the observations from odd months and another one from even months, and both must give similar results. But we will also make use of external astronomical knowledge to validate results, e.g., known distances to particular stars. For distinguishing “noise” from “signal,” we have implemented robust outlier rejection schemes. The quality of the data coming directly from the satellite and from the daily pipeline is assessed with a special system called First Look running also at the SOC.
Vik Nagjee: The CU3: Core Processing Unit is responsible for ensuring the accuracy of the data being collected by Gaia, as it is being collected, so as to ensure the accuracy of the eventual 3-D catalog of the Milky Way.
InterSystems Caché is the data platform used by CU3 to quickly determine that Gaia is working properly and that the data being downloaded is trustworthy. Caché was chosen for this task because of its proven ability to rapidly ingest large amounts of data, populate extremely complex astrometric data models, and instantly make the data available for just-in-time analytics using SQL, NoSQL, and object paradigms.
One million GB of data easily qualifies as Big Data. What makes InterSystems Caché unique is not so much its ability to handle very large quantities of data, but its abilities to provide just-in-time analytics on just the right data.
We call this “Big Slice” — which is where analytics is performed just-in-time for a focused result.
A good analogy is how customer service benefits from occasional Big Data analytics. Breakthrough customer service comes from improving service at the point of service, one customer at a time, based on just-in-time processing of a Big Slice – the data relevant to the customer and her interactions. Back to the Gaia mission: at the conclusion of five years of data collection, a true Big Data exercise will plot the solar map. Yet, frequently ensuring data accuracy is an example of the increasing strategic need for our “Big Slice” concept.
Q7. What kind of databases and analytics tools do you use for the Gaia`s data pipeline?
Uwe Lammers: At the SOC all systems use InterSystems’ Caché database. Despite some initial hiccups, Cache´ has proved to be a good choice for us. For analytics we use a few popular generic astronomical tools (e.g., topcat), but most are custom-made and specific to Gaia data. All DPCs had originally used relational databases, but some have migrated to Apache’s Hadoop.
Q8. Specifically for the Initial Data Treatment/First Look (IDT/FL) database, what are the main data management challenges you have?
Uwe Lammers: The biggest challenge is clearly the data volumes and the steady incoming stream that will not stop for the next five years. The satellite sends us 40-100 GB of compressed raw data every day, which the daily pipeline needs to process and store the output in near real time, as otherwise we quickly accumulate backlogs.
This means all components, the hardware, databases, and software, have to run and work robustly more or less around the clock. The IDTFL database grows daily by a few hundred gigabytes, but not all data has to be kept forever. There is an automatic cleanup process running that deletes data that falls out of chosen retention periods. Keeping all this machinery running around the clock is tough!
Vik Nagjee: Gaia’s data pipeline imposes some rather stringent requirements on the data platform used for the Initial Data Treatment/First Look (IDT/FL) database. The technology must be capable of ingesting a large amount of data and converting it into complex objects very quickly. In addition, the data needs to be immediately accessible for just-in-time analytics using SQL.
ESAC initially attempted to use traditional relational technology for the IDT/FL database, but soon discovered that a traditional RDBMS couldn’t ingest discrete objects quickly enough. To achieve the required insert rate, the data would have to be ingested as large BLOBs of approximately 50,000 objects, which would make further analysis extremely difficult. In particular, the first look process, which requires rapid, just-in-time analytics of the discrete astrometric data, would be untenable. Another drawback to using traditional relational technology, in addition to the typical performance and scalability challenges, was the high cost of the hardware that would be needed.
Since traditional RDBMS technology couldn’t meet the stringent demands imposed by CU3, ESAC decided to use InterSystems Caché.
Q9. How did you solve such challenges and what lessons did you learn until now?
Uwe Lammers: I have a good team of talented and very motivated people and this is certainly one aspect.
In case of problems we are also totally dependent on quick response times from the hardware vendors, the software developers and InterSystems. This has worked well in the past, and InterSystems’ excellent support in all cases where the database was involved is much appreciated. As far as the software is concerned, the clear lesson is that rigorous validation testing is essential – the more the better. There can never be too much. As a general lesson, one of my favorite quotes from Einstein captures it well: “Everything should be made as simple as possible, but no simpler.”
Q10. What is the usefulness of the CU3’s IDT/FL database for the Gaia’s mission so far?
Uwe Lammers: It is indispensable. It is the central working repository of all input/output data for the daily pipeline including the important health monitoring of the satellite.
Vik Nagjee: The usefulness of CU3’s IDT/FL database was proven early in Gaia’s mission. During the commissioning period for the satellite, an initial look at the data it was generating showed that extraneous light was being gathered. If the situation couldn’t be corrected, the extra light could significantly degrade Gaia’s ability to see and measure faint objects.
It was hypothesized that water vapor from the satellite outgassed in the vacuum of space, and refroze on Gaia’s mirrors, refracting light into its focal plane. Although this phenomenon was anticipated (and the mirrors equipped with heaters for that very reason), the amount of ice deposited was more than expected. Heating the mirrors melted the ice and solved the problem.
Scientists continue to rely on the IDT/FL database to provide just-in-time feedback about the efficacy and reliability of the data they receive from Gaia.
Qx Anything else you wish to add?
Uwe Lammers: Gaia is by far the most interesting and challenging project I have every worked on.
It is fascinating to see science, technology, and a large diverse group of people working together trying to create something truly great and lasting. Please all stay tuned for exciting results from Gaia to come!
Vik Nagjee: As Dr. Lammers said, Gaia is truly one of the most interesting and challenging computing projects of all time. I’m honored to have been a contributor to this project, and cannot wait to see the results from the Gaia catalog. Here’s to unraveling the chemical and dynamical history of our Galaxy!
Dr. Uwe Lammers, Gaia Science Operations Manager at the European Space Agency.
Uwe Lammers has a PhD in Physics and a degree in Computer Science and has been working for the European Space Agency on a number of space science mission for the past 20 years. After being involved in the X-ray missions
EXOSAT, BeppoSAX, and XMM-Newton, Gaia caught his attention in 2004.
As of late 2005, together with William O’Mullane, he built up the Gaia Science Operations Centre (SOC) at ESAC near Madrid. From early 2006 to mid-2014 he was in charge of the development of AGIS and is now leading the SOC as Gaia Science Operations Manager.
Vik Nagjee is a Product Manager for Data Platforms at InterSystems.
He’s responsible for Performance and Scalability of InterSystems Caché, and spends the rest of his time helping people (prospects, application partners, end users, etc.) find perfect solutions for their data, processing, and system architecture needs.
Follow ODBMS.org on Twitter: @odbmsorg