ODBMS.org http://www.odbms.org Operational Database Management Systems Thu, 21 Sep 2017 04:06:09 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.16 Meet Greenplum 5: The World’s First Open-Source, Multi-Cloud Data Platform Built for Advanced Analytics http://www.odbms.org/2017/09/meet-greenplum-5-the-worlds-first-open-source-multi-cloud-data-platform-built-for-advanced-analytics/ http://www.odbms.org/2017/09/meet-greenplum-5-the-worlds-first-open-source-multi-cloud-data-platform-built-for-advanced-analytics/#comments Thu, 21 Sep 2017 04:03:18 +0000 http://www.odbms.org/?p=10967  


The largest and most innovative organizations in the world have deployed Pivotal Greenplum, the leading massively parallel analytical data platform, to help solve their most strategic analytical challenges. Challenges from fraud management and risk analysis to cybersecurity and IoT. These, and other important analytical workloads, are technically impossible or cost-prohibitive to run on traditional data platforms. In 2015, Pivotal shook up the data warehouse and analytics industry by taking Greenplum open source.

Today we’re thrilled to announce the latest innovation to the most powerful, agile, and mission-critical data platform for advanced analytics: Pivotal Greenplum 5. This massive release centers around three significant new capabilities and improvements:

  • Multi-Cloud Deployment. Greenplum 5 is now certified and available on Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP), VMWare vSphere, and OpenStack in addition to currently supported on-premises options. Pivotal also offers deployment assistance and managed services on all these platforms.
  • Integrated Analytics. Greenplum 5 eliminates analytical silos by providing a single scale-out environment for next-generation advanced analytics (machine learning, graph, text, geospatial) as well traditional (BI/reporting) workloads.
  • Fast Development of Analytical Innovations. Open source community innovations combined with Pivotal Engineering agile development practices means faster delivery of analytical innovation for customers and the community.

Multi-Cloud Data Analytics

Run your analytics anywhere you need them

Support for analytics in multi-cloud environments is an important requirement for many organizations in 2017.

A major reason for that is that organizations are adopting the cloud on a project by project basis and in an incremental fashion. Often, different groups within the enterprise want the flexibility to instantiate and shut down their own analytical environments in Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP), or private clouds. They want the freedom to select the best cloud platform for each project and workload based on ease of use, performance, and total cost of ownership. Just as important, organizations want the elasticity and disaster recovery capabilities that multiple cloud environments enable.The present and future of analytics is multi-cloud.

Unlike both legacy enterprise data warehouses (EDWs) and new “cloud” data warehouses, all Greenplum platform optimizations are made in the software and not on proprietary hardware and/or network configurations. This makes Greenplum 5 a flexible yet powerful, infrastructure-agnostic platform able to run anywhere you need it, including:

  • All public clouds: AWS, Azure, and GCP with Bring your own License (BYOL) and hourly offerings
  • Private clouds: VMware vSphere and OpenStack
  • On Premises (Dedicated Hardware): Dell EMC DCA appliances, Dell EMC Blueprints, HP, and Cisco certified configurations, and customer-supplied hardware

An infrastructure-agnostic analytics platform such as Greenplum 5 has a number of benefits  when selecting where to run the platform:

  • Helps avoid cloud/hardware vendor lock-in, enabling your organization to leverage the best available infrastructure at competitive prices.
  • Provides cloud adoption flexibility by enabling organizations to migrate designated analytical workloads to the cloud, while retaining others on-premises due to business, governance, or other requirements.
  • Eases the deployment of the best and most appropriate infrastructure for the each project or independent environment (ETL, model building, testing, scoring, BI), helping your analytical users (ETL developers, data scientists, analysts) stay productive and focused on the needs of the business.
  • Allows for quickly instantiating new clusters in minutes when running on the AWS or Azure Marketplaces, with no impact on existing environments.

Integrated Analytics: ML, Graph, GeoSpatial and More

One platform for all compute-intensive and complex analytical needs

Before the explosion of new data sources, the EDW was the best place from which to provide as close to a 360-degree analytical view of the business as possible. In recent years, many organizations have deployed disparate analytics alternatives to the EDW in an attempt to glean more sophisticated insights from its data. These alternatives include:

  • Cloud data warehouses
  • Machine learning frameworks
  • Graph databases
  • Geospatial tools
  • Text analytics environments

Often these new deployments have resulted in the creation of analytical silos that are too complex to integrate with existing EDWs, thus significantly limiting enterprise-wide insights and innovation.

Unlike the traditional EDW and newer alternatives, Greenplum 5 eliminates data silos by integrating traditional and advanced analytics in one scale-out analytical platform. Here are some of the interfaces and operators integrated in Greenplum 5:

  • Open Source, Parallel Machine Learning, and Graph Analytics: Apache MADlib is an open source library for scalable and parallel analytics. It provides data-parallel implementations of machine learning, mathematical, statistical, and graph methods on Greenplum 5. MADlib uses Greenplum’s massively parallel processing (MPP) architecture’s full compute power to process very large data sets, whereas other products are limited by the amount of data that can be loaded into memory on a single node. MADlib algorithms can also be invoked from a familiar SQL interface so they are easy to create and use.
  • Open Source, Parallel GeoSpatial Analytics: Unlike the proprietary geospatial capabilities available in some EDWs, Greenplum 5 provides massively scalable geospatial analytics based on the PostGIS open source project. Pivotal takes full advantage of the vibrant PostGIS community and partner ecosystem to constantly deliver GIS innovations.
  • Parallel Text Analytics: Pivotal Greenplum 5 users have access to GPText, an Apache Solr-powered text analytics engine that is optimized for Greenplum’s MPP architecture. GPText 2.0 takes the flexibility and configurability of Solr and merges it with the scalability and easy SQL interface of Greenplum, dramatically simplifying and speeding up the time to insight for massive quantities of raw text data, including semi-structured and unstructured data (social media feeds, email databases, documents, etc.).
  • Support for Popular Python and R Analytical Libraries through Procedural Language Extensions (PL/X): Greenplum 5 allows users to write user defined functions (UDFs) in a wide range of languages including SQL, Perl, Python, R, C, and Java, and supports the parallelized and distributed execution of these UDFs in data science workflows. Furthermore, Greenplum users have the ability to leverage functions from any of the add-on packages of these languages (i.e. NLTK for Python, rstan for R) in these UDFs. Greenplum 5 also provides easy-to-use installers for the most popular add-on libraries for Python and R.
  • Support for Spark with Greenplum-Spark Connector (GSC): The new GSC provides Spark users like data scientists a native connection to Pivotal Greenplum 5. GSC allows users to load data at high speed from Greenplum into Spark and to run workloads on the Spark cluster. Result sets from computation on the Spark cluster can then be pushed back into Greenplum for further analysis and persistent storage.


Greenplum 5 and its integrated analytical operators enable enables users to operationalize analytical models at scale and ship tangible business innovation in record time. For example:

  • Machine learning in the database at-scale provides data science and analytics teams with a platform for rapidly responding to new business opportunities and challenges. Model training can be done at-scale in the database on-demand. Model scoring may be operationalized on the platform or models can be exported to run elsewhere including in modern data microservices architectures running on a Platform-as-a-Service (PaaS) such as Pivotal Cloud Foundry®.
  • The ability to process, analyze, and search on multi-structured text documents using modern libraries (Python) and operators (Apache Solr) combined with machine learning, provides the ideal platform for assessing a wide variety of multi-structured content.
  • For customers with Geographical Information Systems (GIS) requirements (e.g. retailers, banks, federal government), Greenplum 5 offers the ability to combine GeoSpatial analytics with machine learning. For example, a large retailer can easily understand how customers use different store locations, anticipate which stores will see an increased demand for particular items, and forecast changing markets, all leading to improved  customer satisfaction and increased revenue. By providing these capabilities in the analytic data platform, analysis can be done at scale thereby avoiding the risk and effort of sampling.
  • Data scientists can use the tools with which they are comfortable, including Python and R, that process and analyze data at-scale without requiring data movement.
  • SQL-based, data platform integrated analytics deliver faster time to market for building and deploying data science models.

Fast Development of Analytical Innovations.

100% Commitment to Open Source: Fast Innovation working with the PostgreSQL Community

In Greenplum 5, we merged 3000+ PostgreSQL improvements into the Greenplum core and provided new capabilities from PostgreSQL in many areas including performance, support for JSON and HSTORE for semistructured data, and native support for additional data types such as Universal Unique Identifiers (UUID) and raster geospatial module for advanced geospatial analysis.

Beyond fast delivery of new capabilities, aligning PostgreSQL and Greenplum Database open source communities gives our customers a strategic advantage as they are in control of the software they deploy, without vendor lock-in, while allowing open influence on product direction.

Agile Development: Constant Delivery of New Analytical Capabilities in Greenplum

For more than three years the Pivotal Greenplum engineering team has adopted Pivotal’s agile development practices (small/focused teams, pair programming, test driven development, and continuous integration). This has dramatically increased the pace of innovation, with new releases of the platform landing on a monthly basis, far outpacing both traditional open source and proprietary alternatives. There is no other analytical platform on the planet delivering innovation at the velocity of Pivotal Greenplum.

 Greenplum 5 Supporting Quotes

Pivotal Greenplum Customer

“We used Greenplum running on AWS to build an advertising solution that’s really changing our industry. We are very excited about the multi-cloud capabilities and the new analytics that Greenplum 5 brings to the table and hope to continue our close partnership with Pivotal.”

John Conley, Vice President Data Warehousing, Conversant.

Learn more about how Conversant is using Greenplum.



“Innovation is alive at Greenplum. The data platform continues to thrive for use cases involving petabyte-scale data sets requiring the service levels and concurrency of a proven SQL engine at open source prices.”

Tony Baer, Principal Analyst, Information Management, Ovum


“Pivotal’s 5th version of the Greenplum Data Platform allows our customer’s to feel confident that the critical analytics needed to run their businesses will continue to grow in capabilities, without fear of vendor lock in and in the spirit of open source.  It’s a major release that has shown tremendous interest from many of our most innovative and demanding customers.”

Dan Feldhusen, President, ZData, An Atos Business


“Pivotal Greenplum 5.0 is a huge step forward. It’s the most performant version yet; it runs wherever you need it to; and it provides an incredible set of analytic capabilities to power both business intelligence and machine learning. With this release, Greenplum is more than a data warehouse, it’s a data platform.”

Elisabeth Hendrickson, Vice President of Data R&D, Pivotal

For more information

About the Author

Cesar RojasCesar Rojas serves as the Head of Product Marketing for Pivotal Greenplum, responsible for setting the messaging and go to market strategy for Greenplum. Prior to joining Pivotal, Mr. Rojas was Director of Product Marketing for the Teradata Portfolio for Hadoop and Teradata Aster offerings. Mr. Rojas is an advanced analytics and data management veteran with 15 years of experience working for the largest data analytics vendors as well as successful data startups. Mr. Rojas has an MBA with emphasis in eBusiness from Notre Dame de Namur University, as well as a bachelor’s in Computer Engineering.

Originally published here.

http://www.odbms.org/2017/09/meet-greenplum-5-the-worlds-first-open-source-multi-cloud-data-platform-built-for-advanced-analytics/feed/ 0
HOW TO ACHIEVE 1.5 MILLION OPS/SECOND WITH REDIS http://www.odbms.org/2017/09/how-to-achieve-1-5-million-opssecond-with-redis/ http://www.odbms.org/2017/09/how-to-achieve-1-5-million-opssecond-with-redis/#comments Wed, 20 Sep 2017 03:47:38 +0000 http://www.odbms.org/?p=10962




In this Ask a Redis Expert™ webinar, Redis Labs’ Chief Developer Advocate Itamar Haber, explains how to measure, monitor, make sense of and maximize Redis performance.

The phrase “Lightning fast” takes on a different meaning with Redis- not only can it get you to 1.5 million operations/sec with a single EC2 instance, it can help you achieve this with an arsenal of data structures and commands that deliver in-database analytics.

Slide deck: http://www.slideshare.net/itamarhaber…
List of external references: https://gist.github.com/itamarhaber/2…

Sponsored by Redis Lab

http://www.odbms.org/2017/09/how-to-achieve-1-5-million-opssecond-with-redis/feed/ 0
Big Data World, Singapore , 11-12 October 2017 http://www.odbms.org/2017/09/big-data-world-singapore-11-12-october-2017/ http://www.odbms.org/2017/09/big-data-world-singapore-11-12-october-2017/#comments Tue, 19 Sep 2017 23:13:11 +0000 http://www.odbms.org/?p=10956 BDWS no URLBig Data World, Singapore 2017

 11-12 October 2017 | Marina Bay Sands, Singapore| 9.30am – 5pm | www.bigdataworldasia.com


Big Data World is the very best place designed to help data and business professionals to shape their big data strategies.

Join us this October at Marina Bay Sands, Singapore for the most important gathering of big data decision makers and influencers in Asia. Get access to the world’s best big data expertise, a world-class conference programme and a host of exciting event features:

  • Source from 300 leading providers and solution leaders including SAP, Huawei, Fujitsu, Docker, Ashnik, Gigamon, Qnap, KDDI, Athena Dynamics, Infor, Riverbed, Dynatrace, Veeam, Infortrend, Netapp, Service Now and many more.
  • Be inspired by some 350 prominent industry experts from blue-chip companies, leading organisations, service providers and innovative SMEs includingAirbnb, DBS Bank, Zalora Group, Standard Chartered Bank, Life.SREDA, The Hong Kong Polytechnic University and many more – all speaking in a compelling conference and seminar programme, which covers all the major technology and business issues.
  • Network with thousands of your peers, industry visionaries, leaders and people who have faced – and overcome – the same challenges as you.
  • Visit our industry-leading sister events for FREE on the same ticket access – Cloud Expo Asia, Cloud & Cyber Security Expo, Smart IoT and Data Centre World, Singapore providing comprehensive solutions for professionals in one location.

Do not miss the opportunity to be part of this innovative event where you can truly contextualise, understand and employ data it to your advantage. Register now for your FREE ticket: http://www.cloudexpoasia.com/datadriven

http://www.odbms.org/2017/09/big-data-world-singapore-11-12-october-2017/feed/ 0
Major Machine Learning Investment in Kx Technology http://www.odbms.org/2017/09/major-machine-learning-investment-in-kx-technology/ http://www.odbms.org/2017/09/major-machine-learning-investment-in-kx-technology/#comments Tue, 19 Sep 2017 17:29:22 +0000 http://www.odbms.org/?p=10946 19 September 2017

First Derivatives plc

(“FD” or the “Group”)

Major Machine Learning Investment in Kx Technology

FD (AIM:FDP.L, ESM:FDP.I) announces a range of initiatives to put machine learning (ML) capabilities at the heart of future development for the Group’s Kx technology, in direct response to increasing interest from current and potential customers. The measures announced today will accelerate delivery of pipeline opportunities in software and consulting and provide access to an increased pool of ML specialists to help increase traction in this rapidly growing area.

Machine learning is an application of artificial intelligence (AI) that allows computer systems to use algorithms to adapt to enable better outcomes, based on data rather than explicit programming. Kx technology, incorporating the market-leading in memory time series database kdb+, is able to rapidly process vast quantities of data using less computing resource than competing technologies and is therefore ideally placed to enable adoption of ML across multiple industries and use cases. IDC estimates that the market for ML-related technology will increase from $12.5 billion in 2017 to more than $46 billion in 2020.

 The Group has already received considerable interest from existing and potential customers in areas such as Capital Markets, IIOT, Retail and Digital Marketing with a view to harnessing the power of Kx for ML purposes.  The Group is in the process of recruiting a team of ML experts with extensive commercial experience of implementing AI solutions in finance and other industries and who have worked with teams including Deepmind.

 These ML experts will be supplemented by additional Kx senior technical resources to exploit this exciting commercial opportunity. It is expected that, as part of the development effort, interfaces will be created to enable Kx to accelerate processing and deliver real-time capabilities to ML applications developed using other technologies. The initiative will be led by Mark Sykes, a member of the Group’s executive committee.

 To meet the expected demand for ML and AI consultancy the Group has signed an agreement with Brainpool, a specialist consultancy with 130 ML engineers working across commercial and academic institutions. These specialists have domain expertise in a variety of industries. Brainpool’s consultants will receive training in the core Kx technology and will be able to work as part of Kx project teams assembled to deliver the benefits of ML to the Group’s customers.

 Peter Bebbington, Chief Technical Officer of Brainpool, commented: “Machine learning is a key part of the drive to introduce artificial intelligence into enterprises to deliver automation, maximise efficiencies and generate value. Our agreement with FD will support the proliferation of Kx, an important enabling technology, to enable this transformation.”

Brian Conlon, Chief Executive Officer of Kx, commented: “The interest from current and potential customers in using Kx for machine learning reinforces our belief that our technology has a major enabling role to play in supporting the development of ML and AI technology. The measures we have announced today will allow Kx to power real-time, mission critical ML applications and support our drive to position Kx across multiple industries.”


For further information please contact:

First Derivatives plc


Brian Conlon, Chief Executive Officer

Graham Ferguson, Chief Financial Officer

Ian Mitchell, Head of Investor Relations


+44(0)28 3025 2242


Investec Bank plc (Nominated Adviser and Broker)

Carlton Nelson

Sebastian Lawrence

+44 (0)20 7597 4000

Goodbody (ESM Adviser and Broker) 

Linda Hickey

Finbarr Griffin

+353 1 667 0420

FTI Consulting

Matt Dixon

Dwight Burden

Darius Alexander

Niamh Fogarty

+44 (0)20 3727 1000

About Kx

Kx is a division of FD, a global technology provider with 20 years of experience working with some of the world’s largest finance, technology, retail, pharma, manufacturing and energy institutions. Kx technology, incorporating the kdb+ time-series database, is a leader in high-performance, in-memory computing, streaming analytics and operational intelligence. Kx delivers the best possible performance and flexibility for high-volume, data-intensive analytics and applications across multiple industries. The Group operates from 14 offices across Europe, North America and Asia Pacific, including its headquarters in Newry, and employs more than 1,800 people worldwide.

For further information, please visit www.firstderivatives.com

http://www.odbms.org/2017/09/major-machine-learning-investment-in-kx-technology/feed/ 0
On “Data Quality” http://www.odbms.org/2017/09/on-data-quality/ http://www.odbms.org/2017/09/on-data-quality/#comments Sun, 17 Sep 2017 01:08:57 +0000 http://www.odbms.org/?p=10944

I have interviewed a number of Data Scientists and asked them questions on Data Quality.

I listed their reply, below. Perhaps they are useful for your work. Take what you think is relevant for you and leave the rest. And if you wish you can quote some of them in your publications (all interviews are listed with the relevant links below)


Jeff Saltzhttp://www.odbms.org/2017/08/qa-with-data-scientists-jeff-saltz/

Q. How do you ensure data quality?

Data quality is a subset of the larger challenge of ensuring that the results of the analysis are accurate or described in an accurate way. This covers the quality of the data, what one did to improve the data quality (ex. remove records with missing data) and the algorithms used (ex. were the analytics appropriate). In addition, it includes ensuring an accurate explanation of the analytics to the client of the analytics. As you can see, I think of data quality is being an integrated aspect of an end-to-end process (i.e., not a “check” done before one releases the results)

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

With respect to being relevant, this should be addressed by our first topic of discussion – needing domain knowledge. It is the domain expert (either the data scientist or a different person) that is best positioned to determine the relevance of the results.However, evaluating if the analysis is “good” or “correct” is much more difficult, and relates to our previous data quality discussion. It is one thing to try and do “good” analytics, but how does one evaluate if the analytics are “good” or “relevant”? I think this is an area ripe for future research. Today, there are various methods that I (and most others) use. While the actual techniques we use vary based on the data and analytics used, ensuring accurate results ranges from testing new algirhtms with known data sets to point sampling results to ensure reasonable outcomes.

Yanpei Chen: http://www.odbms.org/2017/08/qa-with-data-scientists-yanpei-chen/

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant”?

Here’s a list of things I watch for:

• Proxy measurement bias. If the data is an accidental or indirect measurement, it may differ from the “real” behavior in some material way.

• Instrumentation coverage bias. The “visible universe” may differ from the “whole universe” in some systematic way.

• Analysis confirmation bias. Often the data will generate a signal for “the outcome that you look for”. It is important to check whether the signals for other outcomes are stronger.

• Data quality. If the data contains many NULL values, invalid values, duplicated data, missing data, or if different aspects or the data are not self-consistent, then the weight placed in the analysis should be appropriately moderated and communicated.

• Confirmation of well-known behavior. The data should reflect behavior that is common and well-known. For example, credit card transaction volumes should peak around well-known times of the year. If not, conclusions drawn from the data should be questioned.

My view is that we should always view data and analysis with a healthy amount of skepticism, while acknowledging that many real-life decisions need only directional guidance from the data.

Manohar Swamynathan: http://www.odbms.org/2017/05/qa-with-data-scientists-manohar-swamynathan/

Q. How do you ensure data quality?

Looking at basic statistics (central tendency and dispersion) about the data can give good insight into the data quality. You can perform univariate and multivariate analysis to understand the trends and relationship within, between variables. Summarizing the data is a fundamental technique to help you understand the data quality and issues/gaps. Below figure maps the tabular and graphical data summarization methods for different data types. Note that this mapping is the obvious or commonly used methods, and not an exhaustive list.

Q. How do you know when the data sets you are analyzing are “large enough” to be significant?

Don’t just collect a large pile of historic data from all sources and throw it to your big data engine. Note that many things might have changed over time such as business processes, operating condition, operating model, and systems/tools. So be cautious that your historic training dataset considered for model building should be large enough to capture the trends/patterns that are relevant to the current business problem, otherwise your model might be misleading. Let’s consider an example of a forecasting model which usually have three components i.e. seasonality, trend and cycle. If you are building a model that considers external weather factor as one of the independent variable, note that some parts of USA have seen comparatively extreme winters post 2015, however you do not know if this trend will continue or not. In this case you would require minimum of 2 years data to be able to confirm the seasonality trend repeats, but to be more confident on the trend you can look up to 5 or 6 years historic data, and anything beyond that might not be the actual representation of current trends.

Jonathan Ortiz: http://www.odbms.org/2017/04/qa-with-data-scientists-jonathan-ortiz/

Q. How do you ensure data quality?

The world is a messy place, and, therefore, so is the web and so is data. No matter what you do, there’s always going to be dirty data lacking attributes entirely, missing values within attributes, and riddled with inaccuracies. The best way to alleviate this is for all data users to track provenance of their data and allow for reproducibility of their analyses and models. The open-source software development philosophy will be co-opted by data scientists as more and more of them collaborate on data projects. By storing source data files, scripts, and models on open platforms, data scientists enable reproducibility of their research and allow others to find issues and offer improvements.

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

I think “good” insights are those that are both “relevant” and “correct,” and those are the ones you want to shoot for. As I wrote in Q2, always have a baseline for comparison.

You can do this either by experimenting, where you actually run a controlled test between different options and determine empirically which is the preferred outcome (like when A/B testing or using a Multi-armed Bandit algorithm to determine optimal features on a website), or by comparing predictive models to the current ground truth or projected outcomes from current data.

Also, solicit feedback about your results early and often by showing your customers, clients, and domain experts. Gather as much feedback as you can throughout the process in order to iterate on the models.

Anya Rumyantseva: http://www.odbms.org/2017/03/qa-with-data-scientists-anya-rumyantseva/

Q. How do you ensure data quality?

Quality of data has a significant effect on results and efficiency of machine learning algorithms. Data quality management can involve checking for outliers/inconsistences, fixing missing values, making sure data in columns are within a reasonable range, data is accurate etc. All can be done during the data pre-processing and exploratory analysis stages.

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

I would suggest constantly communicating with other people involved in a project. They can relate insights from data analytics to defined business metrics. For instance, if a developed data science solution decreases shutdown time of a factory from 5% to 4.5%, this is not that exciting for a mathematician. But for the factory owner it means going bankrupt or not!

Dirk Tassilo Hettich: http://www.odbms.org/2017/03/qa-with-data-scientists-dirk-tassilo/

Q. How do you ensure data quality?

Understanding the data at hand by visual inspection. Ideally, browse through the raw data manually since our brain is a super powerful outlier detection apparatus. Do not try to check every value, just get an idea of how the raw data actually looks! Then, looking at the basic statistical moments (e.g. numbers and boxplots) to get a feeling how the data looks like.

Once patterns are identified, parsers can be derived that apply certain rules to incoming data in a productive system.

Q. How do you know when the data sets you are analyzing are “large enough” to be significant?

Very important! I understand the question like this: how do you know that you have enough samples? There is not a single formula for this, however in classification this heavily depends on the amount and distribution of classes you try to classify. Coming from a performance analysis point of view, one should ask how many samples are required in order to successfully perform n-fold cross-validation. Then there is extensive work on permutation testing of machine learning performance results. Of course, Cohen’s d for effect size and or p-statistics deliver a framework for such assessment.

Not to make too much of advertisement, but I wrote exactly about this article in Section 2.5.

Wolfgang Steitz: 

Q. How do you ensure data quality? 

It’s good practice to start with some exploratory data analysis before jumping to the modeling part. Doing some histograms and some time series is often enough to get a feeling for the data and know about potential gaps in the data, missing values, data ranges, etc. In addition, you should know where the data is coming from and what transformations it went through. Ones you know all this, you can start filling the gaps and cleaning your data. Eventually there is even another data set you want to take into account. For some model running in production, it’s a good idea to automate some data quality checks. These tests could be as simple as checking if the values are in the correct range or if there are any unexpected missing values. And of course someone should be automatically notified if things go bad.

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain? 

Presenting results to some domain experts and your customers usually helps. Try to get feedback early in the process to make sure you are working in the right direction and the results are relevant and actionable. Even better, collect expectations first to know how your work will be evaluated later-on.

Paolo Giudici: http://www.odbms.org/2017/03/qa-with-data-scientists-paolo-giudici/

Q. How do you ensure data quality?

For unsupervised problems: checking the contribution of the selected data to between groups heterogeneity and within groups homogeneity; For supervised problems: checking the predictive performance of the selected data.

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

By testing its out-of-sample predictive performance we can check if it is correct. To check its relevance, the insights must be matched with domain knowledge models or consolidated results.

Q. What are the typical mistakes done when analyzing data for a large scale data project? Can they be avoided in practice?

Forget data quality and exploratory data analysis, rushing to the application of complex models. Forgetting that pre-processing is a key step, and that benchmarking the model versus simpler ones is always a necessary pre requisite.

Q. How do you know when the data sets you are analyzing are “large enough” to be significant?

When estimations and/or predictions become quite stable under data and/or model variations.

Andrei Lopatenko: http://www.odbms.org/2017/03/qa-with-data-scientists-andrei-lopatenko/

Q. How do you ensure data quality?

Data quality is not enough, it must be automatically checked. In real world applications it rarely happens that you get data once. Frequently you get a stream of data. If you build an applications about local business, you get a stream of data from provided of data about businesses. If you build an ecommerce site, then you get regular data updates from merchants, and other data providers. The problem is that you can almost never be sure in data quality. In most cases data are dirty.

You have to protect your customers from dirty data. You have to work to discover what problems with data you might have. Frequently problems are not trivial. Sometimes you can see them browsing data directly, frequently toy can not.

For example, in case of local business latitude longitude coordinates might be wrong because provided has a bad data geocoding system. Sometime you do not see problems with data immediately, but only after using them for training some models, where errors are accumulated and lead to wrong results and you have trace back what was wrong.

To ensure data quality once I understand what problems may happen, I build data quality monitoring software. At every step of data processing pipelines I embed tests, you may compare them with unit tests for traditional software development which checks quality of data. They may check total amount of data, existence or non existence of certain values, anomalies in data, compare data to data from previous batch and so on. It required significant error to build data quality tests, but it pays back, they protect from errors in data engineering, data science, incoming data, some system failures , it always pays back.

From my experience almost every company build a set of libraries and code alike to ensue data quality control. We did it in Google, we did it in Apple, we did it Walmart.

In the Recruit Institute of Technology we work on Big Gorilla tools set , which will include our open source software and references to other open source software which may help companies build data quality pipelines

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

Most frequently companies have some important metrics which describe company business. It might be the average revenue per session, the conversion rate, precision of the search engine etc. And your data insights are as good as they improve this metrics. Assume in e-commerce company, the main metrics is average revenue per session (ARPS). And you work on a project of improving extraction of a certain item attribute, for example, from non-structured text.

Questions to ask yourself, will it help to improve ARPS by improving search because it will increase relevance for queries with color intents or faceted queries by color, or by providing better snippets, or by still other means. When one metric does not describe company business and many numbers are needed to understand it. Your data projects might be connected to other metrics. But what’s important is to connect your data insight project to metrics which are representative of company business and improvement of these metrics will be as a significant impact to the company business. Such connection makes a good project.

Q. What are the typical mistakes done when analyzing data for a large scale data project? Can they be avoided in practice?

Typical mistake – assuming that data are clean. Data quality should be examined and checked.

Mike Shumpert: http://www.odbms.org/2017/03/qa-with-data-scientists-mike-shumpert/

Q. How do you ensure data quality?

On the one hand, one of the basic tenets of “big data” is that you can’t ensure data quality – today’s data is voluminous and messy, and you’d better be prepared to deal with it. As mentioned before, “dealing with it” can simply mean throwing some instances out, but sometimes what you think is an outlier could be the most important information you have.

So if you want to enforce at least some data quality, what can you do? It’s useful to think of data as comprising two main types: transactional or reference. Transactional data is time-based and constantly changing – it typically conveys that something just happened (e.g., customer checkouts), although it can also be continuous data sampled at regular intervals (e.g., sensor data). Reference data changes very slowly and can be thought of as the properties of the object (customer, machine, etc.) at the center of the prediction.

Both types of data typically have predictive value: this amount at this location was just spent (transactional) by a platinum-level female customer (reference) – is it fraud? But the two types often come from different sources and can be treated differently in terms of data quality.

Transactional data can be filtered or smoothed to remove transitory outliers, but the problem domain will determine whether or not any such anomalies are noise or real (and thus very important). For example, the $10,000 purchase on a credit card with a typical maximum of $500 is one that deserves further scrutiny, not dismissal.

But reference data can be separately cleansed and maintained via Master Data Management(MDM) technology. This ensures there is only one version of the truth with respect to the object at the core of the prediction and prevents nonsensical changes such as a customer moving from gold status to platinum and back again within 30 seconds. Clean reference data can then be merged with transactional data on the fly to ensure accurate predictions.

Using an Internet of Things (IoT) example, consider a predictive model for determining when a machine needs to be serviced. The model will want to leverage all the sensor data available, but it will also likely find useful factors such as the machine type, date of last service, country of origin, etc. The data stream coming from the sensors usually will not carry with it the reference data and will probably only provide a sensor id. That id can be used to look up relevant machine data and enrich the data stream on the fly with all the features needed for the prediction.

One final point on this setup is that you do not want to go back to the original data sources of record for this on-the-fly enrichment of transactional data with reference data.

You want the cleansed data from the MDM system, and you want that stored in memory for high-performance retrieval.

Romeo Kienzler: http://www.odbms.org/2017/03/qa-with-data-scientists-romeo-kienzler/

Q. How do you ensure data quality?

This is again a vote for domain knowledge. I have someone with domain skills assess each data source manually. In addition I gather statistics on the accepted data sets so some significant changes will raise an alert which – again – has to be validated by a domain expert.

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

I’m using the classical statistical performance measures to assess the performance of a model. This is only about the mathematical properties of a model. Then I check with the domain experts on the significance to their problems. Often a statistically significant result is not relevant for the business. E.g. if I tell you that a bearing will break with 95% probability within the next 6 months might not really help the PMQ (Predictive Maintenance and Quality) guys. So the former can be described as “correct” or “good” whereas the latter as “relevant” maybe.

Elena Simperl: http://www.odbms.org/2017/02/qa-with-data-scientists-elena-simperl/

Q. How do you ensure data quality?

It is not possible to “ensure” data quality, because you cannot say for sure that there isn’t something wrong with it somewhere. In addition, there is also some research which suggests that compiled data are inherently filled with the (unintentional) bias of the people compiling it. You can attempt to minimise the problems with quality by ensuring that there is full provenance as to the source of the data, and err on the side of caution where some part of it is unclassified or possibly erroneous.

One of the things we are researching at the moment is how best to leverage the wisdom of the crowd for ensuring quality of data, known as crowdsourcing. The existence of tools such as Crowdflower makes it easy to organise a crowdsourcing project, and we have had some level of success in image understanding, social media analysis, and data integration Web. However, the best ways of optimising cost, accuracy or time remain to be determined and are different relative to the particular problem or motivation of the crowd one works with.

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

This question links back to a couple of earlier questions nicely. The importance of having good enough domain knowledge comes into play in terms of answering the relevance question. Hopefully a data scientist will have a good knowledge of the domain, but if not then they need to be able to understand what the domain expert believes in terms of relevance to the domain.

The correctness or value of the data then comes down to understanding how to evaluate machine learning algorithms in general, and using domain knowledge to apply to decide whether the trade-offs are appropriate given the domain.

Mohammed Guller: http://www.odbms.org/2017/02/qa-with-data-scientists-mohammed-guller/

Q. How do you ensure data quality?

It is a tough problem. Data quality issues generally occur upstream in the data pipeline. Sometimes the data sources are within the same organization and sometimes data comes from a third-party application. It is relatively easier to fix data quality issues if the source system is within the same organization. Even then, the source may be a legacy application that nobody wants to touch.

So you have to assume that data will not be clean and address the data quality issues in your application that processes data. Data scientists use various techniques to address these issues. Again, domain knowledge helps.

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

This is where domain knowledge helps. In the absence of domain knowledge, it is difficult to verify whether the insight obtained from data analytics is correct. A data scientist should be able to explain the insights obtained from data analytics. If you cannot explain it, chances are that it may be just a coincidence. There is an old saying in machine learning, “if you torture data sufficiently, it will confess to almost anything.”

Another way to evaluate your results is to compare it with the results obtained using a different technique. For example, you can do backtesting on historical data. Alternatively, compare your results with the results obtained using incumbent technique. It is good to have a baseline against which you can benchmark results obtained using a new technique.

Natalino Busa: 

Q. How do you ensure data quality?

I tend to rely on the “wisdom of the crowd” by implementing similar analysis using multiple techniques and machine learning algorithms. When the results diverge, I compare the methods to gain any insight about the quality of both data as well as models. This technique works also well to validate the quality of streaming analytics: in this case the batch historical data can be used to double check the result in streaming mode, providing, for instance, end-of-day or end-of-month reporting for data correction and reconciliation.

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain? 

Most of the time I interact with domain experts for a first review on the results. Subsequently, I make sure than the model is brought into “action”. Relevant insight, in my opinion, can always be assessed by measuring their positive impact on the overall application. Most of the time, as human interaction is part of the loop, the easiest method is to measure the impact of the relevant insight in their digital journey.

Vikas Rathee: 

Q. How do you ensure data quality?

Data quality is very important to make sure the analysis is correct and any predictive model we develop using that data is good. Very simply I would do some statistical analysis on the data, create some charts and visualize information. I also will clean data by making some choice at the time of data preparation. This would be part of the feature engineering stage that needs to be done before any modeling can be done.

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

Getting Insights is what makes the job of a Data Scientist interesting. In order to make sure the insights are good and relevant we need to continuously ask ourselves what is the problem we are trying to solve and how it will be used.

In simpler words, to make improvements in existing process we will need to understand the process and where the improvement is requirement or of most value. For predictive modeling cases, we need to ask how the output of the predictive model will be applied and what additional business value can be derived from the output. We also need to convey what does the predictive model output means to avoid incorrect interpretation by non-experts.

Once the context around a problem has been defined and we proceed to implement the machine learning solution. The immediate next stage is to verify if the solution will actually work.

There are many techniques to measure the accuracy of predictions i.e. testing with historic data samples using techniques like k-fold cross validation, confusion matrix, r-square, absolute error, MAPE (Mean absolution percentage error), p-value etc. We can choose from among many models which show most promising results. There are also ensemble algorithms which generalize the learning and avoid being over fit models.

Christopher Schommer: http://www.odbms.org/2017/01/qa-with-data-scientists-christopher-schommer/

Q. How do you ensure data quality?

To keep a data quality is mostly an adaptive process, for example, because provisions of national law may change or because the analytical aims and purposes of the data owner may vary. Therefore, the ensuring of a data quality should be performed regularly, it should be consistent with the law (data privacy aspects and others), and should be commonly performed by

a team of experts of different education levels (e.g., data engineers, lawyers, computer scientists, mathematicians).

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

In my understanding, an insight is already a valuable/evaluated information, which has been received after a detailed interpretation and which can be used for any kind of follow-up activities, for example to relocate the merchandise or to deeper dig in clusters showing a fraudulent behavior.

However, it is less oportune to rely only on statistical values: an association rule, which shows a conditional probability of, e.g., 90% or more, may be an “insight”, but if the right-hand side of the rule refers to a plastic bag only (which is to be paid (3 cents), at least in Luxembourg), the discovered pattern might be uninteresting.

Slava Akmaev: http://www.odbms.org/2017/01/qa-with-data-scientists-slava-akmaev/

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain? 

In a data rich domain, evaluation of the insight correctness is done either by applying the mathematical model to new “unseen” data or using cross-validation. This process is more complicated in human biology. As we have learned over the years, a promising cross-validation performance may not be reproducible in subsequent experimental data. The fact of the matter is, in life sciences, laboratory validation of computational insight is mandatory. The community perspective on computational or statistical discovery is generally skeptical until the novel analyte, therapeutic target, or biomarker is validated in additional confirmatory laboratory experiments, pre-clinical trials or human fluid samples

Jochen Leidner: http://www.odbms.org/2017/01/qa-with-data-scientists-jochen-leidner/

Q. How do you ensure data quality?

There are a couple of things: first, make sure you know where the data comes from and what the records actually mean.

Is it a static snapshot that was already processed in some way, or does it come from the primary source. Plotting histograms and profiling data in other ways is a good start to find outliners and data gaps that should undergo imputation (filling of data gaps with reasonable fillers). Measureing is key, so doing everything from inter-annotator agreement of the gold data over training, dev-test and test evaluations to human SME output grading consistently pays back the effort.

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

There is nothing quite as good as asking domain experts to vet samples of the output of a system. While this is time consuming and needs preparation (to make their input actionable), the closer the expert is to the real end user of the system (e.g. the customer’s employees using it day to day), the better.

Claudia Perlich: http://www.odbms.org/2016/11/qa-with-data-scientists-claudia-perlich/

Q. How do you ensure data quality?

The sad truth is – you cannot. Much is written about data quality and it is certainly a useful relative concept, but as an absolute goal it will remain an unachievable ideal (with the irrelevant exception of simulated data …).

First of, data quality has many dimensions.

Secondly – it is inherently relative: the exact data can be quite good for one purpose and terrible for another.

Third, data quality is a very different concept for ‘raw’ event log data vs. aggregated and processed data.

Finally, and this is by far the hardest part: you almost never know what you don’t know about your data.

In the end, all you can do is your best! Scepticism, experience, and some sense of data intuition are the best sources of guidance you will have.

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

First of, one should not even have to ask whether the insight is relevant – one should have designed the analysis that led to the insight based on the relevant practical problem one is trying to solve! The answer might be that there is nothing better you can do than status quo. That is still a highly relevant insight! It means that you will NOT have to waste a lot or resources. Taking negative answer into account as ‘relevant’ – if you are running into this issue of the results of data science not being relevant you are clearly not managing data science correctly. I have commented on this here: What are the greatest inefficiencies data scientists face today?

Let’s look at ‘correct’ next. What exactly does it mean? To me it somewhat narrowly means that it is ‘true’ given the data: did you do all the due diligence and right methodology to derive something from the data you had? Would somebody answering the same question on the same data come to the same conclusion (replicability)? You did not overfit, you did not pick up a spurious result that is statistically not valid, etc. Of course you cannot tell this from looking at the insight itself. You need to evaluate the entire process (or trust the person who did the analysis) to make a judgement on the reliability of the insight.

Now to the ‘good’. To me good captures the leap from a ‘correct’ insight on the analyzed dataset to supporting the action ultimately desired. We do not just find insights in data for the sake of it! (well – many data scientists do, but that is a different conversation). Insights more often than not drive decisions. A good insight indeed generalizes beyond the (historical) data into the future. Lack of generalization is not just a matter of overfitting, it is also a matter of good judgement whether there is enough temporal stability in the process to hope that what I found yesterday is still correct tomorrow and maybe next week. Likewise we often have to make judgement calls when the data we really needed for the insight is simply not available. So we look at a related dataset (this is called transfer learning) and hope that it is similar enough for the generalization to carry over. There is no test for it! Just your gut and experience …

Finally, good also incorporates the notion of correlation vs. causation. Many correlations are ‘correct’ but few of them are good for the action one is able to make. The (correct) fact that a person who is sick has temperature is ‘good’ for diagnosis, but NOT good for prevention of infection. At which point we are pretty much back to relevant! So think first about the problem and do good work next!

Ritesh Ramesh: http://www.odbms.org/2016/11/qa-with-data-scientists-ritesh-ramesh/

Q. How do you ensure data quality?

Data Quality is critical. We hear often from many of our clients that ensuring trust in the quality of information used for analysis is a priority. The thresholds and tolerance of data quality can vary across problem domains and industries but nevertheless data quality and validation processes should be tightly integrated into the data preparation steps.

Data scientists should have full transparency on the profile and quality of datasets that they are working with and have tools at their disposal to remediate fixes with proper governance and procedures as necessary. Emerging data quality technologies are embedding leverages machine learning features to detect proactive data errors and make data quality a business-user friendly and an intelligent function more than ever it has been for years

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

Many people view Analytics and Data science as some magic crystal ball into the future events and don’t realize that it is just one of many probable indicators for successful outcomes – If the model predicts that a there’s an 80% chance of success, you also need to read it as there’s still a 20% chance of failure. To really assess the ‘quality’ of of insights from the model you may start with the below areas –

1) Assess whether the model makes reasonable assumptions on the problem domain and takes into account all the relevant input variables and business context – I was recently reading an article on a U.S. based insurer who implemented an analytics model that looked for number of unfavorable traffic incidents to assess risk on the vehicle driver but they missed out on assigning weights to the severity of the traffic incident. If your model makes wrong contextual assumptions – the outcomes can backfire

2) Assess whether the model is run on a sufficient sample of datasets. Modern scalable technologies have made executing analytical models on massive amounts of data possible.

More data the better although every problem does not need large datasets of the same kind

3) Assess where extraneous events like macroeconomic events, weather, consumer trends etc. are considered in the model constraints. Use of external data sets with real time API based integrations is highly encouraged since it adds more context to the model

4) Assess the quality of data used as an input to the model. Feeding wrong data to a good analytics model and expecting it to produce the expected outcomes is an unreasonable expectation. The stakes are higher in high regulatory environments where minimal error in the model might mean millions of dollars of lost revenues or penalties

Even successful organizations who execute seamlessly in generating insights struggle to “close the loop” in translating the insights into the field to drive shareholder value.

It’s always a good practice to pilot the model on a small population, link its insights and actions to key operational and financial metrics, measure the outcomes and then decide whether to improve or discontinue the model

Richard J Self: http://www.odbms.org/2016/11/qa-with-data-scientists-richard-j-self/

Q. How do you ensure data quality? 

Data Quality is a fascinating question. It is possible to invest enormous levels of resource into attempting to ensure near perfect data quality and still fail.

The critical question should, however, start from the Governance perspective of questions such as:

  1. What is the overall business Value of the intended analysis?
  2. How is the Value of the intended insight affected by different levels of data quality (or Veracity)?
  3. What is the level of Vulnerability to our organisation (or other stakeholders) if the data is not perfectly correct (see J Easton of IBM comment above) in terms of reputation, or financial consequences?

Once you have answers to those questions and the sensitivities of your project to various levels of data quality, you will then begin to have an idea of just what level of data quality you need to achieve. You will also then have some ideas about what metrics you need to develop and collect, in order to guide your data ingestion and data cleansing and filtering activities.

Q. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

The answer to this returns to the Domain Expert question. If you do not have adequate domain expertise in your team, this will be very difficult.

Referring back to the USA election, one of the more unofficial pollsters, who got it pretty well right observed that he did it because he actually talked to real people. This is domain expertise and Small Data.

All the official polling organisations have developed a total trust in Big Data and Analytics, because it can massively reduce the costs of the exercise. But they forget that we all lie un-remittingly on line. See the first of the “All Watched Over by Machines of Loving Grace” documentaries at https://vimeo.com/groups/96331/videos/80799353 to get a flavour of this unreasonable trust in machines and big data.


http://www.odbms.org/2017/09/on-data-quality/feed/ 0
Percona Live Open Source Database Conference Europe 2017 —  Q&A with Peter Zaitsev, Co-founder and CEO, Percona http://www.odbms.org/2017/09/percona-live-open-source-database-conference-europe-2017-qa-with-peter-zaitsev-co-founder-and-ceo-percona/ http://www.odbms.org/2017/09/percona-live-open-source-database-conference-europe-2017-qa-with-peter-zaitsev-co-founder-and-ceo-percona/#comments Sat, 16 Sep 2017 04:12:52 +0000 http://www.odbms.org/?p=10937 Q&A with Peter Zaitsev, Co-founder and CEO, Percona

Q1. Percona has been producing the Percona Live conferences since 2011. How has the conference evolved over the years?

The most important evolutions have been in location and scope. Originally, the conference was just in the U.S. and only focused on MySQL. From there, we launched a second conference in Europe and expanded the scope to include NoSQL solutions, including MongoDB, and other open source database solutions. Each move has been made to match the evolution of the market and the introduction of new and interesting technologies. We have been pleased that each change has been met with an increase in participation and enthusiasm from the open source database community. The knowledge sharing and the ability to meet regularly with our colleagues from around the world has been very rewarding.

Q2. Where is this year’s European conference taking place?

The 2017 Percona Live Europe Open Source Database Conference is taking place September 25-27, 2017 at the Radisson Blu Royal Hotel in Dublin, Ireland. Tickets are still available and can be purchased online at https://www.percona.com/live/e17/registration-information.

 Q3. Why the move to Dublin?

We started the European conference series in 2011 in London. While London was a great place for our conference, we wanted to acknowledge how rich and diverse the European market is, so we decided to move the conference location every few years to give more people a chance to attend. Amsterdam was also a fantastic city for the conference. With Dublin serving as a tech hub and the European HQ for many companies, it offers us a new and very exciting place to meet.

Q4. What is the main theme of Percona Live Europe 2017?

The theme for this year’s conference is “Championing Open Source Databases.” We have a great program lined-up with tutorials and sessions focusing on MySQL, MariaDB, MongoDB and other open source database technologies, including time series databases, PostgreSQL and RocksDB.

Q5. Who typically attends the conferences?

The Percona Live Open Source Database Conference series draws from the amazingly diverse open source community. Aside from users and businesses that develop open source database software, we see many enterprise attendees that are exploring the move to open source databases and are interested in learning from the ecosystem. We are really proud that the conference series attracts titans of industry including, Booking.com, Facebook, Google, Intel, Microsoft, Oracle, Slack, VMWare and more.

Q6. Who is going to speak at this year’s conference?

It’s hard to provide just a few names because there are dozens and dozens of top technology experts speaking at the event. Just our keynote line-up includes Rene Cannao from ProxySQL, Tom Arnfeld from Cloudflare, Shlomi Noach from GitHub, Yoshinori Matsunobu from Facebook, Brian Brazil from Robust Perception, Geir Høydalsvik from Oracle, Laine Campbell from OpsArtisan, Charity Majors from Honeycomb, and Peter Zaitsev and Michael Coburn from Percona.

Q7. What technical aspects will you cover in the conference, with respect to databases such as MySQL, MariaDB, MongoDB, PostgreSQL and other open source database technologies?

 Attendees of the conference can find all of these technologies thoroughly covered in multiple talks — many of which are direct tutorials on how to set and deploy them. Our speakers will tackle subjects such as analytics, architecture and design, security, operations, scalability and performance. Percona Live Europe provides in-depth discussions for high-availability, IoT, cloud, big data and other changing business needs.

Q8. What technical aspects will you cover in the conference, with respect to time series databases and RocksDB?

 Time series databases and RocksDB are both focuses at the conference. For time series databases, we’ll have experts discussing using Prometheus, InfluxDB, PostgreSQL and other database technologies to build and run time series database environments. We’ll also have several talks on how to monitor and visualize the time series data using tools like Grafana and Percona Monitoring and Management.

For RocksDB, we’ll have experts who actually develop the software from Facebook in attendance discussing how to deploy the software, and tune the internals to maximize performance. Many of our speakers use RocksDB in their production environments, and will be presenting the ins and outs of how those deployments operate.

Q9. What are you looking forward to the most for this year’s conference?

Of course, knowledge sharing is the most important reason for the conference and I always look forward to seeing what is new and interesting in the open source database world. I also look forward to seeing familiar faces, meeting new people, and having the opportunity to interact with colleagues in an atmosphere that encourages creative and visionary thinking.

Q10. Is there anything else you would like to add?

The Seventh Annual Percona Live Open Source Database Conference will take place April 23-25, 2018 at The Hyatt Regency Santa Clara and Santa Clara Convention Center.

Sponsored by Percona

http://www.odbms.org/2017/09/percona-live-open-source-database-conference-europe-2017-qa-with-peter-zaitsev-co-founder-and-ceo-percona/feed/ 0
Open Source Forum. NOVEMBER 15, 2017 YOKOHAMA, JAPAN http://www.odbms.org/2017/09/open-source-forum-november-15-2017-yokohama-japan-2/ http://www.odbms.org/2017/09/open-source-forum-november-15-2017-yokohama-japan-2/#comments Fri, 15 Sep 2017 04:18:58 +0000 http://www.odbms.org/?p=10933 Open Source Forum is an invitation only event (request an invitation required – deadline is Nov. 10, 23:59, JST) that will be held in Japan annually. The event is designed to advance the open source industry in Japan by bringing the hottest open source technology topics and people together to collaborate.

TKP Garden City Yokohama, Yokohama, Japan

LINK: http://events.linuxfoundation.org/events/open-source-forum

Copyright © 2017 The Linux Foundation®. All rights reserved.
The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see our Trademark Usage page.
Linux is a registered trademark of Linus Torvalds.

http://www.odbms.org/2017/09/open-source-forum-november-15-2017-yokohama-japan-2/feed/ 0
CityGML change detection, Dependency Analysis, California Road Networks http://www.odbms.org/2017/09/citygml-change-detection-dependency-analysis-california-road-networks/ http://www.odbms.org/2017/09/citygml-change-detection-dependency-analysis-california-road-networks/#comments Fri, 15 Sep 2017 03:45:58 +0000 http://www.odbms.org/?p=10929 By Mark Needham at Neo4j

Giannatou wrote a report Graph data mining with Neo4j (PDF) in which she shows how to import a dataset containing California’s road networks and points of interest and then write Cypher queries against it. The source code for the project is also available on GitHub.

A really cool project I came across is citygml-change-detection by Son Nguyenfrom the Department of Civil, Geo and Environmental Engineering at Technical University of Munich. This tool can be used to detect spatio-semantic changes between two arbitrarily large-sized CityGML datasets using Neo4j.

http://www.odbms.org/2017/09/citygml-change-detection-dependency-analysis-california-road-networks/feed/ 0
CDO Summit, London, England: November 29, 2017 http://www.odbms.org/2017/09/cdo-summit-london-england-november-29-2017/ http://www.odbms.org/2017/09/cdo-summit-london-england-november-29-2017/#comments Fri, 15 Sep 2017 03:31:24 +0000 http://www.odbms.org/?p=10926  



http://www.odbms.org/2017/09/cdo-summit-london-england-november-29-2017/feed/ 0
The data analytics solution ready for MiFID II http://www.odbms.org/2017/09/the-data-analytics-solution-ready-for-mifid-ii/ http://www.odbms.org/2017/09/the-data-analytics-solution-ready-for-mifid-ii/#comments Thu, 14 Sep 2017 21:52:30 +0000 http://www.odbms.org/?p=10921 Under MiFID II, financial institutions will need to reach higher data standards.

The data as a service (DaaS) model is increasingly gaining ground among firms seeking analytics solutions to deal with MiFID II requirements on real-time and historical data.

For the first time, financial institutions involved in fixed income, foreign exchange, currency derivatives and commodity derivatives will soon be required to meet the same data standards MiFID I has imposed on the equity markets.

Discover more about how Thomson Reuters has the expertise to ensure you meet your MiFID II obligations

As a result of MiFID II, they will need solutions that allow them to capture and analyze transaction-related data throughout the entire lifecycle of a trade.

This presents a challenge for those broker-dealers who only infrequently execute bond trades, or for investment managers seeking to model transaction cost analysis (TCA)to ascertain whether dealers are quoting the best price.

Solving these MiFID II challenges requires new technologies and the ability to scale them according to a firm’s individual needs, creating a large, untapped opportunity in in the now disrupted over-the-counter markets.

Entire trade lifecycle

With data-as-a-service, organizations are able to build centralized platforms to hold and analyze all the data they need, cutting down costs associated with moving data around.

This helps them increase transparency, because with one tool they can capture and analyze transaction-related data throughout the entire lifecycle of a trade.

It also helps firms improve returns by removing the need to clean and normalize financial data multiple times across the enterprise. Once a centralized, clean dataset is available, all operational groups can use it.

Discover more about how Thomson Reuters has the expertise to ensure you meet your MiFID II obligations

Best execution compliance

Thomson Reuters is partnering with Kx, a division of First Derivatives, to bring its DaaS offering to the latest version of Velocity Analytics, VA8.

The platform features extraordinary new functionality built on the foundation of our financial and risk content, combined with Kx’s robust computing and analytical software.

Thomson Reuters Velocity Analytics

Key enhancements provide ultra-high-speed processing of real-time, streaming and historical data to help EU and non-EU financial firms of all sizes meet their MiFID II obligations.

Process much larger volumes of data from multiple sources in real-time with Thomson Reuters Velocity Analytics 

VA8 enables a broad range of use cases such as best execution compliance, transaction cost analysis, quantitative and systematic trading.

It will also support new multi-asset best execution and SI (Systematic Internaliser) determination capabilities from Thomson Reuters in 2018.

Thomson Reuters Velocity Analytics

Product screenshot of Thomson Reuters Velocity Analytics

Data management benefits

The Kx DaaS platform, Kx Data Refinery, offers a high-performance low-latency data processing tool providing flexible real-time access to time-series data and powerful analytics.

It’s designed to ease the data management and processing burden so users can focus on leveraging the data itself.

Listen — Data as a Service: Realizing its Value for Data Management

There are a complete set of tools for managing data from ingestion through consumption by multiple parties in a consistent, controlled manner.

Kx Data Refinery can handle the full range of OTC and exchange traded instruments with different volumes and velocities.

‘Gold standard’ analytics

Our decision to offer a data-as-a-service platform on VA8 stems from the decades of experience Kx has building large, complex trading systems at banks, hedge funds, exchanges, and regulatory bodies, many of which have long used its kdb+ database.

Real-world benchmarking by STACresearch.com shows that Kx is the gold standard for market data analytics.

As with other Thomson Reuters partnerships, our venture with Kx creates something greater than the sum of its parts, providing our clients with the most straightforward, deep, and effective real-time system for market data analysis available.

Originally published it here.

http://www.odbms.org/2017/09/the-data-analytics-solution-ready-for-mifid-ii/feed/ 0