” I’m intrigued by the general trend towards empowering individuals to share their data in a secure and controlled environment. Democratisation of data in this way has to be the future. Imagine what we will be able to do in decades to come, when individuals have access to their complete healthcare records in electronic form, paired with high quality data from genomics, epigenetics, microbiome, imaging, activity and lifestyle profiles, etc., supported by a platform that enables individuals to share all or parts of their data with partners of their choice, for purposes they care about, in return for services they value – very exciting! “ —Bryn Roberts
I have interviewed Bryn Roberts, Global Head of Operations for Roche Pharmaceutical Research & Early Development, and Site Head in Basel. We talked about using AI and Data Analytics in Pharmaceutical Research.
Q1. What are your responsibilities as Global Head of Operations for Roche Pharmaceutical Research & Early Development, and Site Head in Basel?
Bryn Roberts: I have a broad range of responsibilities that center around creating and operating a highly innovative global R&D enterprise, Roche pRED, where excellent scientific decision making is optimised along with efficiency, effectiveness, sustainability and compliance. Informatics is my largest department and includes workflow platforms in discovery and early development, architecture, infrastructure and software development, data science and digital solutions.
Facilities, infrastructure and end-to-end lab services, including three new R&D center building projects, provide state-of-the-art innovation centers and labs that integrate the latest architectural concepts, instrumentation, automation, robotics and supply chain to facilitate cutting-edge science.
These more tangible assets are complemented by a number of business operations teams, who oversee quality, compliance, risk management and business continuity, information and knowledge management, research contracts, academic and industrial collaborations, change and transformation, procurement, safety-health-environment, etc.
As the Site Head for the Roche Innovation Center in Basel, I am, together with my local leadership team, accountable for the engagement and well-being of more than a thousand Research and Early Development colleagues at our headquarter site.
Our task is to create a vibrant environment that attracts, motivates and equips world-class talent. Initiatives range from scientific meetings, wellbeing programmes, workplace improvements, communication and knowledge sharing, celebrations and social events, to engagement of local academic and governmental organizations, sponsorship of local scientific conferences, and contribution to the overall Roche site development in Basel and Kaiseraugst.
Q2. Understanding a disease now requires integrated data and advanced analytics. What are the most common problems you encounter when integrating data from different sources and how you solve them?
Bryn Roberts: The challenges often relate to the topics represented by the FAIR acronym. These are Findability, Accessibility, Interoperability and Reusability. As with all organizations where large data assets have been generated and acquired over a long period of time, across many departments and projects, it is challenging to establish and maintain these FAIR Data principles. We have been committed to FAIR data for many years and continue to increase our investment in ‘FAIRification’, with particular emphasis currently on clinical trial and real-world data. Addressing the challenges requires a well thought through, and robustly implemented, information architecture incorporating data catalogues based on high quality meta-data, a holistic terminology service that enables semantic data integration, curation processes supporting data quality and annotation, appropriate application of data standards, etc.
On the advanced analytics side, it is very helpful to establish mechanisms for sharing algorithms and analysis pipelines, such as code repositories, and for annotating derived data and insights. Applying the FAIR principles to algorithms and analysis pipelines, as well as datasets, is an excellent way of sharing knowledge and leveraging expertise in an organization.
We are currently implementing a ‘Data Commons’ architecture framework to facilitate data management, integration, ‘FAIRification’, and to enable analysts of different types to leverage fully the data, as well as insights and analyses from their colleagues. Frameworks like this are essential in a large R&D enterprise, utilising complex high-dimensional data (e.g. genomics, imaging, digital monitoring), requiring federation of data and/or analyses, robust single-point-of-truth or master data management, access control, etc. In this regard, we are in our second generation architecture for our platform supporting disease understanding. My colleague, Jan Kuentzer, presented an excellent overview at the PRISME Forum last year and the slides are available if people would like to learn more (Roche Data Commons).
Q3. How do you judge the quality of data, before and after you have done data integration?
Bryn Roberts: This question of data quality is even more complicated than it may first appear. Although there are some more elaborate models out there, conceptualising data quality in two broad perspectives may help.
Firstly, what we might call prescriptive quality, where we can test data against pre-determined standards such as vocabularies, ontologies, allowed values or ranges, counts, etc. This is an obvious step in data quality assessment and can be automated to a large degree, including in database schema and constraints. A very challenging aspect of prescriptive quality is judging the upstream processes involved in data collection and pre-processing. For example, determining: if analytical data have been associated with the correct samples, if a manual entry was correctly read and typed, if data from a collaborator have been falsified. The probability of quality issues such as these can be reduced through robust protocols, in-process QA steps, automation, and algorithms to detect systematic anomalies, etc. In the standard data quality models, we might consider prescriptive quality as covering dimensions such as accuracy, integrity, completeness, conformity, consistency and validity.
Secondly, what we might call the interpretive quality perspective, relating to the way the data will be interpreted and used for decision making. For example, the smoking status of patients with lung diseases may be recorded simply as: current, former or never. Despite the data meeting the prescribed standards, being accurate, complete and conforming to the model, they may not be of sufficient quality to address the complexities of the biology underlying the diseases, where one might need information describing the number of cigarettes smoked per day and the time since the individual smoked their last cigarette.
Similarly, when working with derived data from an algorithm, one may need to understand the training set and boundaries of the model to understand how far the derived data can be interpreted for specific input conditions. One can address some of these issues with meta-data describing how data were generated in lab, clinic or silico.
Q4. What are the criteria you use to select the data to be integrated?
Bryn Roberts: Certainly the quality aspects above play a key role. We have, for example, discarded historical laboratory results when, after careful consideration, we decided that the meta-data (lab protocols, association with target information, etc.) were insufficient for anyone to make meaningful use of them. Data derived from old technologies, despite being valuable at some point in the past, may have been superseded or may not meet today’s requirements, so will have lower priority for integration, although may still be archived for specialist reference. Relevance is another critical factor – we prioritise the integration of data relating to our current molecules or disease targets, and data that we deem to have the most valuable content.
Q5. Is there a risk that data integration introduces unwanted noise or bias? If yes, how do you cope with that?
Bryn Roberts: I’m not too concerned about these aspects when the above architectures and principles are applied. There is clearly a risk of bias when the integrated landscape is incomplete, so understanding what you have, and what you don’t, when searching is important. Storing only aggregated or derived data can be risky, as aggregation can mask properties such as skewness and outliers, and there are obviously benefits in having the ability to access and re-analyse upstream and raw data, as models and algorithms improve or an analyst has a specific use-case. Integration, if performed well, should not introduce additional noise, although noise reduction may potentially mask signals in data when they are aggregated or transformed in other ways.
I often hear people talking about Data Lakes and it certainly seems to be one of the hype terms of the last couple of years. This approach to data ‘integration’ does concern me if not implemented thoughtfully, especially for the complex scientific and clinical data used in R&D. Given that the Big Data stack allows for data to be poured into the metaphorical ‘Lake’ with a very low cost of entry, it is tempting to throw everything in, getting caught up with KPIs such as volume captured, with little thought to the backend use-cases and costs incurred when utilising the data. I also wouldn’t advocate the opposite extreme of RDBMS-only data Warehousing, where the front-end costs and timelines escalate to unreasonable levels and the models struggle to incorporate new data types. There is a pragmatic middle-ground, where up-front work on the majority of data has a positive return on investment, but challenging data are not excluded from the integration. This complementary Warehouse+Lake approach allows for continuous refinement, based on ongoing use of the data, to maximize value over the longer-term.
Q6. What specific data analytics techniques are most effective in pharma R&D?
Bryn Roberts: We have so many data types and use-cases, spanning chemistry, biology, clinic, business, etc. that we apply almost every analytic technique you can think of. Classical statistical methods and visual analytics have broad application, as do modelling and simulation. The latter being used extensively in areas from molecular simulations and computational chemistry to genotype-phenotype associations, pharmacokinetics and epidemiology. We are increasingly using Artificial Intelligence (AI), Machine Learning and Deep Learning, in applications such as image analysis, large scale clinico-genomic analysis, phenotypic profiling and analysis of high dimensional time-series data.
Q7. What are your current main projects in the so called “Precision Medicine”?
Bryn Roberts: In Roche we tend to use the term Personalised Healthcare rather than Precision Medicine, since we have both Pharmaceuticals and Diagnostics divisions. However, the intention is similar in that we want to identify which treatments and other interventions will be effective and safe for which patients, based on profiling, which may include genetics, genomics, proteomics, imaging, etc. We have many initiatives ongoing in research, development and for established products. Developing a deeper understanding how mutational and immunological status of tumours influences response to targeted therapeutics, immunotherapies and combinations is one example. Forward and reverse translation in such examples is critical, as we design clinical trials and select participants then, in turn, inform new research initiatives based on data fed back from the clinic. We have made considerable headway in this space thanks to progress in genomic analysis and quantitative digital pathology, supported by collaborations across the Roche Group, including organizations such as Tissue Diagnostics, Foundation Medicine and Flatiron.
A third quite different example is our application of mobile and sensor technology to monitor symptoms, disease progression and treatment response – the so called “Digital Biomarkers”. We have our most advanced programmes in Multiple Sclerosis (MS) and Parkinson’s Disease (PD), with several more in development. Using these tools, a longitudinal real-world profile is built that, in these complex syndromes, helps us to identify signals and changes in symptoms or general living factors, which may have several potential benefits. In clinical trials we hope to generate more sensitive and objective endpoints with high clinical relevance, with the potential to support smaller and shorter studies, and possibly validate targets in earlier studies that might otherwise be overlooked. In the general healthcare setting, tools like these may have great value for patients, physicians and healthcare systems if they are used to inform tailored treatment regimens and enable supportive interventions such as the timing of home visits or provision of walking aids to reduce falls. For those interested in learning more about our work in MS there is more information available online about our Floodlight Open programme.
Q8. You have been working on trying to detect Parkinson disease. What is your experience of using Deep Learning of that purpose?
Bryn Roberts: The data we collect with the digital biomarker apps fall into two classes: 1) active test data, where the subject performs specific tasks on a daily basis, and 2) continuous passive monitoring data, where the subject carries the device (e.g. smartphone) with them as they go about their daily lives and sensors, such as accelerometers and gyrometers, collect data continuously. These latter data form complex time series, with acceleration and rotation being measured in 3 axes each, many times per second. From these data, we build a picture of the individual’s daily activities and performance, which is ultimately what we hope to improve for patients with our new therapies. We apply Deep Learning to do this activity-performance classification, or Human Activity Recognition (HAR), using deep artificial neural networks that have been trained using well-annotated datasets. Since the data are time-series, the network utilises Long Short-Term Memory (LSTM) layers to provide recurrence, hence the name “Recurrent Neural Network” or RNN. Examples of what we might study here are how well a patient with PD is able to stand up from a chair or climb a staircase.
The advantages of using digital, mobile and AI technologies in this way, compared to infrequent in-clinic assessments, is that they are highly objective and sensitive, have the possibility to detect symptom fluctuations day-to-day, they are performed in the real-world setting providing increased relevance, they have a relatively low burden for patients, and data can be assessed by the patient and/or physician in near real-time so they become better informed and empowered.
Extending the application beyond clinical trials, disease monitoring and management, these technologies have the potential, in some disorders, to deliver solutions with a direct beneficial effect that can be measured objectively through improved outcomes. Thus, this work is also laying the foundation for advances in digital therapeutics, or “digiceuticals”, where we’ve seen a huge increase in interest, and the first regulatory approvals, over the last year or so.
With great power comes great responsibility, so we work closely with the participants and regulators to ensure that data protection and privacy are upheld to the highest standards and that participants are fully informed and consent.
As we have developed the platform over the past few years, the establishment of robust end-to-end data processes and the building of trust has run side-by-side with the technology innovation.
Q9. What kind of public data did you use for training your models?
Bryn Roberts: The Human Activity Recognition (HAR) model was initially trained using two independent public datasets of everyday activity from normal individuals. The first from Stisen et al. (“Smart devices are different: Assessing and mitigating mobile sensing heterogeneities for activity recognition”, Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, 2015.) and the second from Weiss et al. (“The impact of personalization on smartphone-based activity recognition”, AAAI Workshop on Activity Context Representation: Techniques and Languages, 2012.). From these data, 90% were used to train the model and 10% for model validation.
Q10. What results did you obtain so far?
Bryn Roberts: In passive monitoring of gait and mobility we have published, for example, significant differences between healthy subjects and PD patients in parameters such as sitting-to-standing transitions, turning speed when walking and in overall gait parameters. In the active test panel, we have demonstrated correlation with the current standard rating scale for PD (MDS-UPDRS) in symptom areas such as tremor, dexterity, balance and postural stability. However, in some measurements (e.g. rest tremor) the digital biomarker appears to be more sensitive at detecting low-intensity symptoms than the in-clinic rating, and corresponds better with patients’ self-reported data.
For more information, see, for example: “Evaluation of Smartphone-Based Testing to Generate Exploratory Outcome Measures in a Phase 1 Parkinson’s Disease Clinical Trial”, Lipsmeier et al., Movement Disorders, 2018.
Q11. Are there any other technological advances on the horizon that you are excited about?
Bryn Roberts: There’s a lot of activity in the healthcare and pharma sector at the moment around blockchain. Some of the use-cases have potential interest to us in R&D, such as secure sharing of genomic and medical data.
I don’t think blockchain is a requirement to do this effectively but may be an enabler, especially if it gains broad adoption.
I’m intrigued by the general trend towards empowering individuals to share their data in a secure and controlled environment. Democratisation of data in this way has to be the future. Imagine what we will be able to do in decades to come, when individuals have access to their complete healthcare records in electronic form, paired with high quality data from genomics, epigenetics, microbiome, imaging, activity and lifestyle profiles, etc., supported by a platform that enables individuals to share all or parts of their data with partners of their choice, for purposes they care about, in return for services they value – very exciting!
This vision, and even the large datasets available today, are driving a paradigm shift in data management and compute for us. The need to federate, both data and compute, across multiple locations and organisations is a change from the recent past, when we could internalise all the data of interest into our own data centers. Cloud, Hadoop, containers and other technologies that support federation are maturing quickly and are a great enabler to big data and advanced analytics in R&D.
What I’m particularly excited about just now is the potential of universal quantum computing (QC). Progress made over the last couple of years gives us more confidence that a fault-tolerant universal quantum computer could become a reality, at a useful scale, in the coming years. We’ve begun to invest time, and explore collaborations, in this field. Initially, we want to understand where and how we could apply QC to yield meaningful value in our space. Quantum mechanics and molecular dynamics simulation are obvious targets, however, there are other potential applications in areas such as Machine Learning.
I guess the big impacts for us will follow “quantum inimitability” (to borrow a term from Simon Benjamin from Oxford) in our use-cases, possibly in the 5-15 year timeframe, so this is a rather longer-term endeavour.
Dr Bryn Roberts
Bryn gained his BSc and PhD in pharmacology from the University of Bristol, UK. Following post-doctoral work in neuropharmacology, he joined Organon as Senior Scientist in 1996. A number of roles followed with Zeneca and AstraZeneca, including team and project leader roles in high throughput screening and research informatics. In 2004 he became head of Discovery Informatics at the AstraZeneca sites in Cheshire, UK.
Bryn joined Roche in Basel in 2006, and his role as Global Head of Informatics was expanded in 2014 to Global Head of Operations for Pharma Research and Early Development. He is also the Centre Head for the Roche Innovation Centre Basel.
Beyond Roche, Bryn is a Visiting Fellow at the University of Oxford, where he is a member of the External Advisory Board of the Dept. of Statistics and the Scientific Management Committee for the Systems Approaches to Biomedical Sciences Centre for Doctoral Training. He is a member of the Advisory Board to the Pistoia Alliance. Bryn was recognized in the Fierce Biotech IT list of Top 10 Biotech Techies 2013 and in the Top 50 Big Data Influencers in Precision Medicine by the Big Data Leaders Forum in 2016.
Follow us on Twitter: @odbmsorg
“Bundesdruckerei has transformed itself from a traditional manufacturer of official documents such as passports and ID cards to one of the leading companies for security solutions, also in the digital sector.”–Ilya Komarov
I have interviewed Ilya Komarov, researcher at the German Federal Printing Office (“Bundesdruckerei“). We talked about how they use Blockchain and a NoSQL database – Cortex -for their identity and rights management system, FIDES.
Q1. The “Bundesdruckerei” (Federal Printing Office), a German public company, is since 1951, the manufacturer of banknotes, stamps, identity cards, passports, visas, driving licences, and vehicle registration certificates. What do you now?
Ilya Komarov: Bundesdruckerei has transformed itself from a traditional manufacturer of official documents such as passports and ID cards to one of the leading companies for security solutions, also in the digital sector. For the development of further, safety-relevant products, the innovation department now relies on the CortexDB platform.
Q2. Do you use blockchain technology? If yes, for what?
Ilya Komarov: Although Bundesdruckerei’s ID-Chain technology is based on the data integrity principle of a blockchain, it is adapted to the requirements of powerful and secure identity and rights management.
The difference to blockchain, however, is the bi-directional linking of the blocks as well as the generation of many individual chains rather than a single, increasingly longer chain. Unlike with blockchain, the chain links are connected to each other in both directions, i.e. a block hence knows the next block as well as its predecessor block.
This chain structure makes it possible to quickly check the integrity of the blocks and that of their respective neighbours in both directions and in detail, right down to the very last link. Functions from quantum-mechanical analytics rather than hash values are used as a security mechanism. This mechanism begins with the generation of an atomic wave function for each block in the chain. The blocks can then be idealized as atoms and described in quantum-mechanical terms.
In analogy to nature, these atoms can then join up with other atoms (blocks) to form molecules (blockchains).
By applying these principles, two blocks form unique molecular connections that are used as a security mechanism for the blocks and for the chain as a whole. The ID-Chains now offer the security of linked data structures combined with a high level of flexibility and performance.
Q3. What is Bundesdruckerei using CortexDB for?
Ilya Komarov: We are running the FIDES development project in cooperation with Cortex AG.
The user-centered identity and rights management system is based on a modified blockchain. It integrates Bundesdruckerei’s security functions into the core of the database. People, machines, processes and objects can be integrated into administration and companies of all industries. Legal requirements, such as the European Data Protection Regulation (GDPR), are implemented technologically.
The FIDES development project aims to develop an identity and rights management system in which the user alone has control over his data. Each authorization is stored in the form of a digital authorization blockchain and is inseparably linked to the identity of the data owner. Each blockchain represents a unique link between an authorization, the owner of the authorization, and a user identity. At any time it is possible to determine who accessed which data with which authorizations and when and where these authorizations come from.
Bundesdruckerei is using CortexDB as part of its revolutionary identity and rights management system FIDES where the user alone determines what happens with their data. This user-centric identity management system is based on derived blockchain and cognitive database technologies.
Identities and rights are managed in FIDES in the form of digital rights chains, so-called ID-Chains. An individual ID-Chain is created for each right owned by an identity. This means that the system is made up of millions of chains that have to be searched in a split second.
Within the scope of a development partnership, the NoSQL database from Cortex AG has been specially optimized to meet the requirements of FIDES.
Thanks to smart data normalization, this data can be accessed as fast as lightning without the need for time-consuming searches. As a supplier of high-security solutions, Bundesdruckerei was involved in the development process and integrated the security functions directly into the core of the database, for instance, for encryption and ID-Chain creation and validation.
Q4. What are the typical problems you encounter in ID Management Systems (IDMS) based on encrypted block chain technology?
Ilya Komarov: Blockchain is opening up a vast range of new possibilities, however, due to its technical limits it is not suitable for every situation. The ID-Chains take the principle of linked blocks and adapt it to the requirements of powerful and secure identity and rights management.
The biggest difference to the blockchain is the generation of many individual ID-Chains rather than one ever-longer chain. Each of these is a separate chain that can be easily saved or discontinued. This means, for instance, that individual chains can be marked as invalid, making it technically possible to implement the right to be forgotten. This is neither possible nor aimed for with conventional blockchains.
Q5. What are the lessons learned so far?
Ilya Komarov: FIDES is currently being used in proof-of-concept projects by our customers. The scope of application is wide: from small private businesses to large groups and public authorities.
Problem trials conducted at our customers show that many of the problems are related to identification and the possession of data.
As soon as the data owner has full control over the data, many privacy problems will become irrelevant. This is the case, for instance, with patient data in the field of healthcare or personal data in dealings with public authorities.
Access control systems as well as IoT devices also require secure administration of identity and rights.
Thanks to the flexibility of CortexDB and ID-Chains, FIDES has what it takes to solve these problems.
Ilya Komarov has been working at Bundesdruckerei’s research and development departments since 2008. His research subjects include identity management, security systems and big data. In 2017, he started to work on developing new blockchain technologies for the secure management of identities and authorisations.
Mr. Komarov received his degree in Computer Science at Humboldt University in Berlin.
Follow us on Twitter: @odbmsorg
“I think the biggest challenge is that in the rail business we have a very large set of old and country specific regulations that date back many decades. These regulations are meant to protect passengers, but some of them are not anymore fitting to the modern capabilities of technology and instead drive cost and slow innovation down dramatically.” –Gerhard Kress
Artificial intelligence acts as an enabler for many innovations in the rail industry.
In this interview, I have spoken with Gerhard Kress, who is heading Data Services globally for the Rail business, and is responsible for the Railigent ® solution at Siemens. We discussed innovation and the use of AI and Data-driven technologies in the transport sector, and specifically how the Siemens´ Railigent solution is implemented.
Railigent is cloud based, designed to help rail operators and rail asset owners, to improve fleet availability and improve operations, for example by enabling intelligent data gathering, monitoring, and analysis for prescriptive maintenance in the rail transport industry.
This interview is conducted in the context of a new EU funded project, called (LeMO (“Leveraging Big Data to Manage Transport Operations“). The LeMO project studies and analyses big data in the European transport domain, with focus to five transport dimensions: mode, sector, technology, policy and evaluation.
LeMO conducts a series of case studies, in order to provide recommendations on the prerequisites of effective big data implementation in the transport field. The LeMO project has selected Siemens´ Railigent as one of the main seven case studies in transport in Europe.
Q1. What is your role at Siemens?
Gerhard Kress: At Siemens, I am heading Data Services globally for the Rail business. This means that I am heading all MindSphere Aplication Centers that focus on rail topics from the United States to Australia.
Q2. What are in your opinion the main challenges, barriers and limitations that transport researchers, engineers and policy makers today face as they work to build efficient, safe, and sustainable transportation systems?
Gerhard Kress: I think the biggest challenge is that in the rail business we have a very large set of old and country specific regulations that date back many decades. These regulations are meant to protect passengers, but some of them are not anymore fitting to the modern capabilities of technology and instead drive cost and slow innovation down dramatically.
Q3. You manage all the data analytics centers of Siemens for rail transport globally. What are the main challenges you face and how you solve them?
Gerhard Kress: There are a number of key challenges. First challenge is to develop offerings that are globally relevant for our customers. The rail industry is very different across the continents and with country specific legislation there is a very diverse landscape of requirements to address. Another important challenge is to manage the network of data analytics centers in such a way that they on leverage local specifics but at the same time learn from each other and act as a true global network.
The way we have addressed these issues is to set up in each MindSphere Application Center small agile teams that work very closely with customers to understand their issues and understand how they create tangible value. These teams create customer specific solutions, but use existing reusable analytics elements to build these solutions. In order to make this happen globally, we have created a simple set of tools and processes and have also centralized the product development function across all of the data analytics centers.
Q4. You are responsible for the Railigent Asset Management Solution at Siemens. What is it?
Gerhard Kress: Railigent is our solution to help customers manage their rail assets smarter and get more return from them. Therefore Railigent contains a cloud based platform layer to support ingest and storage of large and diverse data sets, high end data analytics and applications. This layer is open, both for customers and partners.
On top of this layer, Railigent provides a large set of applications for monitoring and analyzing rail assets. Also here applications and components can be provided by partners or customers. Target is to help customers improve fleet availability, maintenance and improve operations.
Q5. Who are the customers for Railigent, and what benefits do they have in using Railigent?
Gerhard Kress: Customers for Railigent are for example rail operators and rail asset owners. The key benefits for them are that they can improve asset and system availability and therefore offer more services with the same fleet size. Railigent also helps these customers reduce lifecycle costs for their assets and improve their operations.
Q6. What are the main technological components of Railigent?
Gerhard Kress: Basically Railigent builds on technologies from Mindsphere, enlarged with rail specific elements like data models / semantics, rail specific format translators and of course our applications and data analytics models.
The foundation is a data lake in the cloud (AWS) in which we store the data in a loosely coupled format and create the use case specific structures on read.
Data gets ingested in batch or stream, depending on the source and during the data ingest we already apply the first analytics models to validate and augment the data.
For every step in the data lifecycle we use active notifications to move the data to the next stage and as much as it is possible we rely on platform services from AWS to build the applications.
Our applications consist out of micro services which we bundle in a common UI framework. And we have deployed a full CI/CD pipeline based on Jenkins.
Data analytics happens either in sand boxes, when the model is still in development or in the full platform.
We use mostly Python and pySpark, but are also using other technologies when needed (e.g. deep learning driven approaches).
Q7. MindSphere is Siemens´ cloud-based, open IoT operating system for the Industrial Internet of Things. What specific functionalities of MindSphere did you use when implementing Railigent and why?
Gerhard Kress: MindSphere and Railigent share a lot of core functions, especially in the way how the data connectivity and data handling is implemented and how IT security of the system is ensured. The key reason to use the same technology is that it is essential for our customers to have a secure and reliable platform. And the key differentiator we provide is generating the insight. Therefore the pure platform functionalities are not differentiating and therefore there is no rational for developing them all over again.
Q8. What other technologies did you use for implementing Railigent?
Gerhard Kress: The key elements of Railigent are not its platform components, but the reusable analytics elements as well as the rail specific applications.
For the analytics side, Railigent uses all types of analytics libraries, but also mathematical approaches newly developed by Siemens. Especially for the industrial data area, new mathematical approaches are often required and such approaches were then integrated into Railigent.
Q9. The foundation of Railigent is a data lake in the cloud (AWS) in which you store the data in a loosely coupled format and create the use case specific structures on read. Can you elaborate on how you handle batch and/ or stream of data?
Gerhard Kress: Railigent has to handle a large number of data formats, like diagnostic messages, sensor data, work orders, spare part movements, images, etc.
We receive data in all sorts of legacy formats, most of them are batch formats. These files we decrypt and then annotate them with specific information to enable us to quickly find the data back again and also to ensure it can be attributed to the right fleet and the right customer. Then we create a generic JSON file which we store in our data lake.
For stream data we use mostly MQTT as transfer protocol and then create the same JSON file format to persist this data in our data lake.
Q10. What data analytics do you perform?
Gerhard Kress: Most of the data analytics in Railigent is based on machine learning or deep learning. This can be classifiers to identify components which are already showing distress, or it can be prediction algorithms to identify the remaining useful life of a component. Most of the machine learning is supervised learning, but there are aso cases where unsupervised learning techniques are implemented.
Q11. Is there a difference in performing analytics when the model is still in development or in the full platform?
Gerhard Kress: We develop models usually in a type of sandbox environment so that we can quickly iterate the model on real data, validate the results and improve the model further. Once a certain quality is reached, we transfer the model into the operational environment of Railigent. This requires us to be much more formal in the deployment so that results are correct and the performance is predictable. And, of course, the model then needs to be integrated into the production data pipeline in order to be available 24/7
Q12. What are the lessons learned so far in using Railigent?
Gerhard Kress: So far we have quite a few lessons learnt from Railigent deployments and most of them deal with the value generation for our customers.
We have learned that we needed to be closer to our customers in creating applications. For this we have set up an agile “Accelerator” team, developing the first insights with the customer in the first week and making this all accessible through a first web application. These teams are often collocated with the customers so that we can jointly create the right solution for the customer problem.
In our customer activities, we have learned to see the customer value as the main driver of our activities. We try now to quickly deliver a first application which we then improve later, but we also focus on making the insights actionable so that the customer can immediately start implementing and gaining the promised value.
With regards to handling data, we have learned that in a complex big data world with many different types of data elements, we have to resort to a schema on read approach as an integrated and overarching logical data model would not be feasible.
These learnings we have implemented already and we can see the value which the changes helped create for our customers.
Q13. What is the roadmap ahead for Railigent?
Gerhard Kress: Railigent is just going to be released in Version 2.0 in July and we are aiming for Version 3 in December. On the roadmap we do not only have customer facing application features for rolling stock and signaling, but also technical building blocks, analytics components as well as platform topics. Our focus in V3 is on features to better integrate partners, capabilities to allow partners and customers easier access to analytics elements inside Railigent and handling of realtime data. Additionally we will improve the operations topics and deploy a new type of highly scalable and overarching analytical capabilities to be used by any application inside Railigent.
Our target is to become even more relevant for our customers and provide tangible value.
Gerhard Kreß is responsible for Data services in the Siemens Mobility GmbH, aiming to build up new customer offerings enabled by data analytics for both rail vehicles and rail infrastructure.
Before that he was in Siemens Corporate Technology responsible for implementing the corporate big data initiative “Smart Data to Business” and he worked for 3 years in Siemens Corporate Strategy in the corporate program to refine the IT strategy for the Siemens businesses. There he was also responsible for setting up the Siemens big data initiative.
Prior to his work in Corporate Strategy he spent 8 years working in Siemens IT Solutions and Services (SIS), managing systems and technologies for the global service desks and in the project management of major IT outsourcing projects.
Gerhard Kreß started his professional career in McKinsey & Company, where he focused on growth initiatives and high tech industries.
He holds a German diploma in Theoretical Physics and a Master of Arts in International Relations and European Studies.
During his studies, Gerhard Kreß worked for the student NGO “AEGEE-Europe” where he was President and Member of the European board of the organisation.
– Railigent® – the application suite to manage your assets smarter – mov, (Link to YouTube Video), May 13, 2018
– UNDERSTANDING AND MAPPING BIG DATA In Transport Sector, LeMo Project Deliverable D1.1, May 13, 2018 (Link to .PDF 78 pages)
– BIG DATA POLICIES In Transportation, LeMo Project Deliverable D1.2, May 31, 2018 (Link to .PDF 60 pages)
– BIG DATA METHODOLOGIES, TOOLS AND INFRASTRUCTURES in Transportation, LeMo Project Deliverable D1.3, July 16, 2018 (Link to .PDF 50 pages)
– LeMO Project Web site (LINK). The LeMO project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 770038.
– Generating Transport Data, by Filipe Teixeira, ODBMS.org,May 16, 2018
– On Smart Cities and Mobility. Q&A with Praveen Subramani, ODBMS.org, May 28, 2018
– On Data and Transportation. Q&A with Carlo Ratti, ODBMS.org, Apr. 11, 2018
– On Logistics and 3D printing. Q&A with Alan P. Amling, Vice President, UPS Corporate Strategy, ODBMS.org, Apr, 2018
Follow us on Twitter: @odbmsorg
“Debugging AI systems is harder than debugging traditional ones, but not impossible. Mainly it requires a different mindset, that allows for nondeterminism and a partial understanding of what’s going on. Is the problem in the data, the system, or in how the system is being applied to the data? Debugging an AI is more like domesticating an animal than debugging a program.”– Pedro Domingos.
I have interviewed Pedro Domingos, professor of computer science at the University of Washington and the author of “The Master Algorithm“, a bestselling introduction to machine learning for non-specialists. We talked about various topics related to Artificial Intelligence, Machine Learning, and Deep Learning.
Q1. What’s the difference between Artificial Intelligence, Machine Learning, and Deep Learning?
Pedro Domingos: The goal of AI is to get computers to do things that in the past have required human intelligence: commonsense reasoning, problem-solving, planning, decision-making, vision, speech and language understanding, and so on. Machine learning is the subfield of AI that deals with a particularly important ability: learning. Just as in humans the ability to learn underpins all else, so machine learning is behind the growing successes of AI.
Deep learning is a specific type of machine learning loosely based on emulating the brain. Technically, it refers to learning neural networks with many hidden layers, but these days it’s used to refer to all neural networks.
Q2. Several AI scientists around the world would like to make computers learn so much about the world, so rapidly and flexibly, as humans (or even more). How can learned results by machines be physically plausible or be made understandable by us?
Pedro Domingos: The results can be in the form of “if . . . then” rules, decision trees, or other representations that are easy for humans to understand. Some types of models can be visualized. Neural networks are opaque, but other types of model don’t have to be.
Q3. It seems no one really knows how the most advanced AI algorithms do what they do. Why?
Pedro Domingos: Since the algorithms learn from data, it’s not as easy to understand what they do as it would be if they were programmed by us, like traditional algorithms. But that’s the essence of machine learning: that it can go beyond our knowledge to discover new things. A phenomenon may be more complex than a human can understand, but not more complex than a computer can understand. And in many cases we also don’t know what humans do: for example, we know how to drive a car, but we don’t know how to program a car to drive itself. But with machine learning the car can learn to drive by watching video of humans drive.
Q4. That could be a problem. Do you agree?
Pedro Domingos: It’s a disadvantage, but how much of a problem it is depends on the application. If an AI algorithm that predicts the stock market consistently makes money, the fact that it can’t explain how it did it is something investors can live with. But in areas where decisions must be justified, some learning algorithms can’t be used, or at least their results have to be post-processed to give explanations (and there’s lots of research on this).
Q5. Let`s consider an autonomous car that relies entirely on an algorithm that had taught itself to drive by watching a human do it. What if one day the car crashed into a tree, or even worst killed a pedestrian?
Pedro Domingos: If the learning took place before the car was delivered to the customer, the car’s manufacturer would be liable, just as with any other machinery. The more interesting problem is if the car learned from its driver. Did the driver set a bad example, or did the car not learn properly?
Q6. Would it be possible to create some sort of “AI-debugger” that let you see what the code does while making a decision?
Pedro Domingos: Yes, and many researchers are hard at work on this problem. Debugging AI systems is harder than debugging traditional ones, but not impossible. Mainly it requires a different mindset, that allows for nondeterminism and a partial understanding of what’s going on. Is the problem in the data, the system, or in how the system is being applied to the data? Debugging an AI is more like domesticating an animal than debugging a program.
Q7. How can computers learn together with us still in the loop?
Pedro Domingos: In so-called online learning, the system is continually learning and performing, like humans. And in mixed-initiative learning, the human may deliberately teach something to the computer, the computer may ask the human a question, and so on. These types of learning are not widespread in industry yet, but they exist in the lab, and they’re coming.
Q8. Professional codes of ethics do little to change peoples’ behaviour. How is it possible to define incentives for using an ethical approach to software development, especially in the area of AI?
Pedro Domingos: I think ethical software development for AI is not fundamentally different from ethical software development in general. The interesting new question is: when AIs learn by themselves, how do we keep them from gowing astray? Fixed rules of ethics, like Asimov’s three laws of robotics, are too rigid and fail easily. (That’s what his robot stories were about.) But if we just let machines learn ethics by observing and emulating us, they will learn to do lots of unethical things. So maybe AI will force us to confront what we really mean by ethics before we can decide how we want AIs to be ethical.
Q9. Who will control in the future the Algorithms and Big Data that drive AI?
Pedro Domingos: It should be all of us. Right now it is mainly the companies that have lots of data and sophisticated machine learning systems, but all of us – as citizens and professionals and in our personal lives – should become aware of what AI is and what we can do with it. That’s why I wrote “The Master Algorithm”: so everyone can understand machine learning well enough to make the best use of it. How can I use AI to do my job better, to find the things I need, to build a better society? Just like driving a car does not require knowing how the engine works, but it does require knowing how to use the steering wheel and pedals, everyone needs to know how to control an AI system, and to have AIs that work for them and not for others, just like they have cars and TVs that work for them.
Q10. What are your current research projects?
Pedro Domingos: Today’s machine learning algorithms are still very limited compared to humans. In particular, they’re not able to generalize very far from the data.
A robot can learn to pick up a bottle in a hundred trials, but if it then needs to pick up a cup it has to start again from scratch. In contrast, a three-year-old can effortlessly pick anything up.
So I’m working on a new machine learning paradigm, called symmetry-based learning, where the machine learns individual transformations from data that preserve the essential properties of an object, and can then compose the transformations in many different ways to generalize very far from the data. For example, if I rotate a cup it’s still the same cup, and if I replace a word by a synonym in a sentence the meaning of the sentence is unchanged. By composing transformations like this I can arrive at a picture or a sentence that looks nothing like the original, but still means the same.
It’s called symmetry-based learning because the theoretical framework to do this comes from symmetry group theory, an area of mathematics that is also the foundation of modern physics.
Pedro Domingos is a professor of computer science at the University of Washington and the author of “The Master Algorithm”, a bestselling introduction to machine learning for non-specialists.
He is a winner of the SIGKDD Innovation Award, the highest honour in data science, and a Fellow of the Association for the Advancement of Artificial Intelligence. He has received a Fulbright Scholarship, a Sloan Fellowship, the National Science Foundation’s CAREER Award, and numerous best paper awards.
He received his Ph.D. from the University of California at Irvine and is the author or co-author of over 200 technical publications. He has held visiting positions at Stanford, Carnegie Mellon, and MIT. He co-founded the International Machine Learning Society in 2001. His research spans a wide variety of topics in machine learning, artificial intelligence, and data science, including scaling learning algorithms to big data, maximizing word of mouth in social networks, unifying logic and probability, and deep learning.
The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. New York: Basic Books, 2015.
What’s Missing in AI: The Interface Layer. In P. Cohen (ed.), Artificial Intelligence: The First Hundred Years. Menlo Park, CA: AAAI Press. To appear.
How Not to Regulate the Data Economy. Medium, 2018.
Ten Myths About Machine Learning. Medium, 2016.
Debugging data: Microsoft researchers look at ways to train AI systems to reflect the real world. Microsoft AI Blog. | John Roach
– Alchemy: Statistical relational AI.
– SPN: Sum-product networks for tractable deep learning.
– RDIS: Recursive decomposition for nonconvex optimization.
– BVD: Bias-variance decomposition for zero-one loss.
– NBE: Bayesian learner with very fast inference.
– RISE: Unified rule- and instance-based learner.
– VFML: Toolkit for mining massive data sources.
– online machine learning class. Pedro Domingos (Link to series of YouTube videos)
– On Technology Innovation, AI and IoT. Interview with Philippe Kahn ODBMS Industry Watch, January 27, 2018
– On Artificial Intelligence and Analytics. Interview with Narendra Mulani ODBMS Industry Watch, August 12, 2017
– How Algorithms can untangle Human Questions. Interview with Brian Christian. ODBMS Industry Watch, March 31, 2017
–Big Data and The Great A.I. Awakening. Interview with Steve Lohr. ODBMS Industry Watch, December 19, 2016
–Machines of Loving Grace. Interview with John Markoff. ODBMS Indutry Watch, August 11, 2016
–On Artificial Intelligence and Society. Interview with Oren Etzioni. ODBMS Industry Watch, January 15, 2016
Follow us on Twitter: @odbmsorg
” An AI powered assistant can give you much better advice the more it knows about you and if it can collect data without burdening you. While this challenge creates the obvious but surmountable privacy issues, there is an interesting data integration challenge here to collect data from the digital breadcrumbs we leave all over, such as posts on social media, photos, data from wearables. Reconciling all these data sets into a meaningful and useful signal is a fascinating research problem!”–Alon Halevy
I have interviewed Alon Halevy, CEO of Megagon Labs. We talked about happiness, AI-powered journaling and the HappyDB database.
Q1. What is HappyDB?
Alon Halevy: HappyDB is a crowd-sourced text database of 100,000 answers to the following question: what made you happy in the last 24 hours (or 3 months)? Half of the respondents were asked about the last 24 hours and the other half about the last 3 months.
We collected HappyDB as part of our research agenda on technology for wellbeing. At a basic level, we’re asking whether it is possible to develop technology to make people happier. As part of that line of work, we are developing an AI-powered journaling application in which the user writes down the important experiences in their day. The goal is that the smart journal will understand over time what makes you happy and give you advice on what to do. However, to that end, we need to develop Natural Language Processing technology that can understand better the descriptions of these moments (e.g., what activity did the person do, with whom, and in what context). HappyDB was collected in order to create a corpus of text that will fuel such NLP research by our lab and by others.
Q2. The science of happiness is an area of positive psychology concerned with understanding what behaviors make people happy in a sustainable fashion. How is it possible to advance the state of the art of understanding the causes of happiness by simply looking at text messages?
Alon Halevy: One of the main observations of the science of happiness is that a significant part of people’s wellbeing is determined by the actions they choose to do on a daily basis (e.g., encourage social interactions, volunteer, meditate, etc). However, we are often not very good at making choices that maximize our sustained happiness because we’re focused on other activities that we think will make us happier (e.g., make more money, write another paper).
Because of that, we believe that a journaling application can give advice based on personal experiences that the user has had. The user of our application should be able to use text, voice or even photos to express their experiences.
The text in HappyDB is meant to facilitate the research required to understand texts given by users.
Q3. What are the main findings you have found so far?
Alon Halevy: The happy moments we see in HappyDB are not surprising in nature — they describe experiences that are known to make people happy, such as social events with family and friends, achievements at work and enjoying nature and mindfulness. However, given that these experiences are expressed in so many different ways in text, the NLP challenge of understanding the important aspects of these moments are quite significant.
Q4. The happy moments are crowd-sourced via Amazon’s Mechanical Turk. Why?
Alon Halevy: That was the only way we could think of getting such a large corpus. I should note that we only get 2-3 replies from each worker, so this is not a longitudinal study about how people’s happiness changes over time.
The goal is just to collect text describing happy moments.
Q5. You mentioned that HappyDB is a collection of happy moments described by individuals experiencing those moments. How do you verify if these statements reflect the true state of mind of people?
Alon Halevy: You can’t verify such a corpus in any formal sense, but when you read the moments you see they are completely natural. We even have a moment from one person who was happy for getting tenure!
Q6. What is a reflection period?
Alon Halevy: A reflection period is how far back you look for the happy moment. For example, moments that cover a reflection period of 24 hours tend to mention a social event or meal, while moments based on a reflection of 3 months tend to mention a bigger event in life such as the birth of a child, promotion, or graduation.
Q7. The HappyDB corpus, like any other human-generated data, has errors and requires cleaning. How do you handle this?
Alon Halevy: We did a little bit of spell correcting and removed some moments that were obviously bogus (too long, too short). But the hope is that the sheer size of the database is its main virtue and the errors will be minor in the aggregate.
Q8. What are the main NLP problems that can be studied with the help of this corpus?
Alon Halevy: There are quite a few NLP problems. The most basic is to figure out what is the activity that made the person happy (and distinguish the words describing the activity from all the extraneous text). Who are the people that were involved in the experience? Was there anything in the context that was critical (e.g, a sunset). We can ask more reflective questions, such as was the person happy from the experience because of a mismatch between their expectations and reality? Do men and women express happy experiences in different ways? Finally, can we create an ontology of activities that would cover the vast majority of happy moments and reliably map text to one or more of these categories.
Q9. What analysis techniques did you use to analyse HappyDB? Were you happy with the existing NLP techniques? or is there a need for deeper NLP techniques?
Alon Halevy: We clearly need new NLP techniques to analyze this corpus and ones like it. In addition to standard somewhat shallow NLP techniques, we are focusing on trying to define frame structures that capture the essence of happy moments and to develop semantic role labeling techniques that map from text to these frame structures and their slots.
Q10. Is HappyDB open to the public?
Qx Anything else you wish to add?
Alon Halevy: Yes, I think developing technology for wellbeing raises some interesting challenges for data management in general. An AI powered assistant can give you much better advice the more it knows about you and if it can collect data without burdening you. While this challenge creates the obvious but surmountable privacy issues, there is an interesting data integration challenge here to collect data from the digital breadcrumbs we leave all over, such as posts on social media, photos, data from wearables. Reconciling all these data sets into a meaningful and useful signal is a fascinating research problem!
Dr. Alon Halevy is a computer scientist, entrepreneur and educator. He received his Ph.D. in Computer Science at Stanford University in 1993. He became a professor of Computer Science at the University of Washington and founded the Database Research Group at the university.
He founded Nimble Technology Inc., a company providing an Enterprise Information Integration Platform, and TransformicInc., a company providing access to deep web content. Upon the acquisition of Transformicby Google Inc., he became responsible for research on structured data as a senior staff research scientist at Google’s head office and was engaged in research and development, such as developing Google Fusion Tables. He has served as CEO of Megagon Labs since 2016.
Dr. Halevy is a Fellow of the Association of Computing Machinery (ACM Fellow) and received the VLDB 10-year best paper award in 2006.
–Paper: HappyDB: A Corpus of 100,000 Crowdsourced Happy Moments , Akari Asai, Sara Evensen, Behzad Golshan, Alon Halevy, Vivian Li, Andrei Lopatenko, Daniela Stepanov, Yoshihiko Suhara, Wang-Chiew Tan, Yinzhan Xu
–Software: BigGorilla is an open-source data integration and data preparation ecosystem (powered by Python) to enable data scientists to perform integration and analysis of data. BigGorilla consolidates and documents the different steps that are typically taken by data scientists to bring data from different sources into a single database to perform data analysis. For each of these steps, we document existing technologies and also point to desired technologies that could be developed.
The different components of BigGorilla are freely available for download and use. Data scientists are encouraged to contribute code, datasets, or examples to BigGorilla. We hope to promote education and training for aspiring data scientists with the development, documentation, and tools provided through BigGorilla.
–Software: Jo Our work is inspired by psychology research, especially a field known as Positive Psychology. We are developing “Jo” – an agent that helps you record your daily activities, generalizes from them, and helps you create plans that increase your happiness. Naturally, this is no easy feat. Jo raises many exciting technical challenges for NLP, chatbot construction, and interface design: how can we build an interface that’s useful but not intrusive. Read more about Jo!
– Data Integration: From Enterprise Into Your Kitchen, Alon Halevy – SIGMOD/PODS Conference 2017
Follows us on Twitter: @odbmsorg
“I would argue that the definition of “small” keeps getting bigger as hardware improves and more economical storage options abound. As data volumes get bigger and bigger, organizations are looking to graduate out of the “small” arena and start to leverage big data for truly transformational projects. “–Ben Vandiver
I have interviewed Ben Vandiver, CTO at Vertica. Main topics of the interview are: Vertica database, the Cloud, and the new Vertica cloud architecture: Eon Mode.
Q1. Can you start by giving us some background on your role and history at Vertica?
Ben Vandiver: My bio covers a bit of this, but I’ve been at Vertica from version 2.0 to our newly released 9.1. Along the way I’ve seen Vertica transform from a database that could barely run SQL and delete records, to an enterprise grade analytics platform. I built a number of the core features of the database as a developer. Some of my side-projects turned into interesting features: Flex tables is Vertica’s schema-on-read mechanism and Key/Value allows fast, scalable single node queries. I started the Eon mode project 2 ½ years ago to enable Vertica to take advantage of variable workloads and shared storage, both on-premises and in the cloud. Upon promotion to CTO, I continue to remain engaged with development as a core architect, but I also look after product strategy, information flow within the Vertica organization, and technical customer engagement.
Q2. Is the assumption that “One size does not fit all” (aka Michael Stonebraker) still valid for new generation of databases?
Ben Vandiver: Mike’s statement of “One size does not fit all” still holds and if anything, the proliferation of new tools demonstrates how relevant that statement still is today. Each tool is designed for a specific purpose and an effective data analytics stack combines a collection of best-in-class tools to address an organization’s data needs.
For “small” problems, a single flexible tool can often address these needs. But what exactly is “small” in today’s world?
I would argue that the definition of “small” keeps getting bigger as hardware improves and more economical storage options abound. As data volumes get bigger and bigger, organizations are looking to graduate out of the “small” arena and start to leverage big data for truly transformational projects. These organizations would benefit from developing a data stack that incorporates the right tools – BI, ETL, data warehousing, etc. – for the right jobs, and choosing solutions that favour a more open, ecosystem-friendly architecture.
This belief is evident in Vertica’s own product strategy, where our focus is to build the most performant analytical database on the market, free from underlying infrastructure and open to a wide range of ecosystem integrations.
Q3. Vertica, like many databases, started off on-premises and has moved to the cloud. What has that journey looked like?
Ben Vandiver: Our pure software, hardware agnostic approach has enabled Vertica to be deployed in a wide variety of configurations, from embedded devices to multiple cloud platforms. Historically, most of Vertica’s deployments have been on-premises, but we’ve been building AMIs for running Vertica in the Amazon cloud since 2008. More recently, we have built integrations for S3 read/write and cloud monitoring.
In our 9.0 release last year, we extended our SQL-on-Hadoop offering to support Amazon S3 data in ORC or Parquet format, enabling customers to run highly-performant analytical queries against their Hadoop data lakes on S3.
And of course, with our latest 9.1 release, the general availability of Eon Mode represents a transformational leap in our cloud journey.
With Eon Mode, Vertica is moving from simply integrating with cloud services to introducing a core architecture optimized specifically for the cloud, so customers can capitalize on the economics of compute and storage separation.
Q4. Vertica just released a completely new cloud architecture, Eon Mode. Can you describe what that is and how it works?
Ben Vandiver: Eon Mode is a new architecture that places the data on a reliable, cost-effective shared storage, while matching Vertica Enterprise Mode’s performance on existing workloads and supporting entirely new use cases. While the design reuses Vertica’s core optimizer and execution engine, the metadata, storage, and fault tolerance mechanisms are re-architected to enable and take advantage of shared storage. A sharding mechanism distributes load over the nodes while retaining the capability of running node-local table joins.
A caching layer provides full Vertica performance on in-cache data and transparent query on non-cached data with mildly degraded performance.
Eon Mode initially supports running on Amazon EC2 compute and S3 storage, but includes an internal API layer that we have built to support our roadmap vision for other shared storage platforms such as Microsoft Azure, Google Cloud, or HDFS.
Eon Mode demonstrates strong performance, superior scalability, and robust operational behavior.
With these improvements, Vertica delivers on the promise of cloud economics, by allowing customers to provision only the compute and storage resources needed – from month to month, day to day, or hour to hour – while supporting efficient elasticity. For organizations that have more dynamic workloads, this separation of compute and storage architecture represents a significant opportunity for cloud savings and operational efficiency.
Q5. What are the similarities and differences between Vertica Enterprise Mode and Vertica Eon Mode?
Ben Vandiver: Eon Mode and Enterprise Mode have both significant similarities and differences.
Both are accessible from the same RPM – the choice of mode is determined at the time of database deployment. Both use the same cost-based distributed optimizer and data flow execution engine. The same SQL functions that run on Enterprise Mode will also run on Eon Mode, along with Vertica’s extensions for geospatial, in-database machine learning, schema-on-read, user-defined functions, time series analytics, and so on.
The fundamental difference however, is that Enterprise Mode deployments must provision storage capacity for the entire dataset whereas Eon Mode deployments are recommended to have cache for the working set. Additionally, Eon Mode has a lightweight re-subscribe and cache warming step which speeds recovery for down nodes. Eon Mode can rapidly scale out elastically for performance improvements which is the key to aligning resources to variable workloads, optimizing for cloud economics.
Many analytics platforms offered by cloud providers are not incentivized to optimize infrastructure costs.
Q6. How does Vertica distribute query processing across the cluster in Eon Mode and implement load balancing?
Ben Vandiver: Eon Mode combines a core Vertica concept, Projections, with a new sharding mechanism to distribute processing load across the cluster.
A Projection describes the physical storage for a table, stipulating columns, compression, sorting, and a set of columns to hash to determine how the data is laid out on the cluster. Eon introduces another layer of indirection, where nodes subscribe to and serve data for a collection of shards. During query processing, Vertica assembles a node to serve each shard, selecting from available subscribers. For an elastically scaled out cluster, each query will run on just some of the nodes of the cluster. The administrator can designate sub-clusters of nodes for workload isolation: clients connected to a sub-cluster run queries only on nodes in the sub-cluster.
Q7. What do you see as the primary benefits of separating compute and storage?
Ben Vandiver: Since storage capacity is decoupled from compute instances, an Eon Mode cluster can cost-effectively store a lot more data than an Enterprise Mode deployment. The resource costs associated with maintaining large amounts of historical data is minimized with Eon Mode, discouraging using two different tools (such as a data lake and a query engine) for current and historical queries.
The operational cost is also minimized since node failures are less impactful and easier to recover from.
On the flip side, running many compute instances against a small shared data set provides strong scale-out performance for interactive workloads. Elasticity allows movement between the two extremes to align resource consumption with dynamic needs. And finally, the operational simplicity of Eon Mode can be impactful to the health and sanity of the database administrators.
Q8. What types of engineering challenges had to be overcome to create and launch this new architecture?
Ben Vandiver: Eon Mode is an application of core database concepts to a cloud environment. Even though much of the core optimizer and execution engine functionality remains untouched, large portions of the operational core of the database are different in Eon Mode. While Vertica’s storage usage maps well to an object store like S3, determining when a file can be safely deleted was an interesting challenge. We also migrated a significant amount of our test infrastructure to AWS.
Finally, Vertica is a mature database, having been around for over 10 years – Eon Mode doesn’t have the luxury to launch as a 0.1 release full of bugs. This is why Eon Mode has been in Beta, both private and public, for the last year.
Q9. It’s still early days for Eon Mode’s general availability, but do you have any initial customer feedback or performance benchmarks?
Ben Vandiver: Although Eon Mode just became generally available, it’s been in Beta for the last year and a number of our Beta customers have had significant success with this new architecture. For instance, one large gaming customer of ours subjected a much smaller Eon Mode deployment to their full production load, and realized 30% faster load rates without any tuning. Some of their queries ran 3-6x faster, even when spilling out of the cache. Operationally, the company’s node recovery was 6-8x faster and new nodes could be added in under 30 minutes. Eon Mode is enabling this customer to not only improve query performance, but the dynamic AWS service consumption resulted in dramatic cost savings as well.
Q10. What should we expect from Vertica in the future with respect to cloud and Eon Mode product development?
Ben Vandiver: We are working on expanding Eon Mode functionality in a variety of dimensions. By distributing work for a shard among a collection of nodes, Eon Mode can get more “crunch” from adding nodes, thus improving elasticity. Operationally, we are working on better support for sub-clusters, no-downtime upgrade, auto-scaling, and backup snapshots for operator error. As mentioned previously, deployment options like Azure cloud, Google cloud, HDFS, and other on-premises technologies are on our roadmap. Our initial 9.1 Eon Mode release is just the beginning. I’m excited at what the future holds for Vertica and the innovations we continue to bring to market in support of our customers.
I spent many years at MIT, picking up a bachelor’s, master’s, and PhD (My thesis was on Byzantine Fault Tolerance of Databases). I have a passion for teaching, having spent several years teaching computer science.
From classes of 25 to 400, I enjoy finding clear ways to explain technical concepts, untangle student confusion, and have fun in the process. The database group at MIT, located down the hall from my office, developed Vertica’s founding C-Store paper.
I joined Vertica as a software engineer in August 2008. Over the years, I worked on many areas of the product including transactions, locking, WOS, backup/restore, distributed query, execution engine, resource pools, networking, administrative tooling, metadata management, and so on. If I can’t answer a technical question myself, I can usually point at the engineer who can. Several years ago I made the transition to management, running the Distributed Infrastructure, Execution Engine, and Security teams. I believe in an inclusive engineering culture where everyone shares knowledge and works on fun and interesting problems together – I sponsor our Hackathons, Crack-a-thon, Tech Talks, and WAR Rooms.
More recently, I’ve been running the Eon project, which aims to support a cloud-ready design for Vertica running on shared storage. While engineering is where I spend most of my time, I occasionally fly out to meet customers, notably a number of bigger ones in the Bay area. I was promoted to Vertica CTO in May 2017.
– For more information on Vertica in Eon Mode, read the technical paper: Eon Mode: Bringing the Vertica Columnar Database to the Cloud.
– To learn more about Vertica’s cloud capabilities visit www.vertica.com/clouds
– On RDBMS, NoSQL and NewSQL databases. Interview with John Ryan ODBMS Industry Watch, 2018-03-09
– On Vertica and the new combined Micro Focus company. Interview with Colin Mahony ODBMS Industry Watch, 2017-10-25
Follow us on Twitter: @odbmsorg
“Time series data” are sequential series of data about things that change over time, usually indexed by a timestamp. The world around us is full of examples –Andrei Gorine.
I have interviewed Andrei Gorine, Chief Technical Officer and McObject co-founder.
Main topics of the interview are: time series analysis, “in-chip” analytics, efficient Big Data processing, and the STAC M3 Kanaga Benchmark.
Q1. Who is using time series analysis?
Andrei Gorine: “Time series data” are sequential series of data about things that change over time, usually indexed by a timestamp. The world around us is full of examples — here are just a few:
• Self-driving cars continuously read data points from the surrounding environment — distances, speed limits, etc., These readings are often collected in the form of time-series data, analyzed and correlated with other measurements or onboard data (such as the current speed) to make the car turn to avoid obstacles, slow-down or speed up, etc.
• Retail industry point-of-sale systems collect data on every transaction and communicate that data to a back-end where it gets analyzed in real-time, allowing or denying credit, dispatching goods, and extending subsequent relevant retail offers. Every time a credit card is used, the information is put into a time series data store where it is correlated with other related data through sophisticated market algorithms.
• Financial market trading algorithms continuously collect real-time data on changing markets, run algorithms to assess strategies and maximize the investor’s return (or minimize loss, for that matter).
• Web services and other web applications instantly register hundreds of millions of events every second, and form responses through analyzing time-series data sets.
• Industrial automation devices collect data from millions of sensors placed throughout all sorts of industrial settings — plants, equipment, machinery, environment. Controllers run analysis to monitor the “health” of production processes, making instant control decisions, sometimes preventing disasters, but more often simply ensuring uneventful production.
Q2. Why is a columnar data layout important for time series analysis?
Andrei Gorine: Time-series databases have some unique properties dictated by the nature of the data they store. One of them is the simple fact that time-series data can accumulate quickly, e.g. trading applications can add millions of trade-and-quote (“TAQ”) elements per second, and sensors based on high-resolution timers generate piles and piles data. In addition, time-series data elements are normally received by, and written into, the database in timestamp order. Elements with sequential timestamps are arranged linearly, next to each other on the storage media. Furthermore, a typical query for time-series data is an analytical query or aggregation based on the data’s timestamp (e.g. calculate the simple moving average, or volume weighted average price of a stock over some period). In other words, data requests often must gain access to a massive number of elements (in real-life often millions of the elements) of the same time series.
The performance of a database query is directly related to the number of I/O calls required to fulfill the request: less I/O contributes to greater performance. Columnar data layout allows for significantly smaller working set data sizes – the hypothetical per-column overhead is neglible compared to per-row overhead. For example, given a conservative 20 bytes per-row overhead, storing 4-byte measurements in the horizontal layout (1 time series entry per row) requires 6 times more space than in the columnar layout (e.g. each row consumes 24 bytes, whereas one additional element in a columnar layout requires just 4 bytes). Since there is less storage space required to store time series data in the columnar layout, less I/O calls are required to fetch any given amount of real (non-overhead) data. Another space-saving feature of the columnar layout is content compression —columnar layout allows for far more efficient and algorithmically simpler compression algorithms (such as run-length encoding over a column). Lastly, row-based layout contains many columns. In other words, when the database run-time reads a row of data (or, more commonly, a page of rows), it is reading many columns. When analytics require only one column (e.g. to calculate an aggregate of a time-series over some window of time), it is far more efficient to read pages of just that column.
The sequential pattern in which time-series data is stored and retrieved in columnar databases (often referred to as “spatial locality”) leads to a higher probability of preserving the content of various cache subsystems, including all levels of CPU cache, while running a query. Placing relevant data closer to processing units is vitally important for performance: L1 cache access is 3 times faster than L2 cache access, 10 times faster than L3 unshared line access and 100 times faster than access to RAM (i7 Xeon). In the same vein, the ability to utilize various vector operations, and the ever growing set of SIMD instruction sets (Single Instruction Multiple Data) in particular, contribute to speedy aggregate calculations. Examples include SIMD vector instructions that operate on multiple values contained in one large register at the same time, SSE (Streaming SIMD Extensions) instruction sets on Intel and AltiVec instructions on PowerPC, pipelining (i.e. computations inside the CPU that are done in stages), and much more.
Q3. What is “on-chip” analytics and how is it different than any other data analytics?
Andrei Gorine: This is also referred to as “in-chip” analytics. The concept of pipelining has been successfully employed in computing for decades. Pipelining is referred to a series of data processing elements where the output of one element is the input of the next one. Instruction pipelines have been used in CPU designs since the introduction of RISC CPUs, and modern GPUs (graphical processors) pipeline various stages of common rendering operations. Elements of software pipeline optimizations are also found in operating system kernels.
Time-series data layouts are perfect candidates to utilize a pipelining approach. Operations (functions) over time-series data (in our product time series data are called “sequences”) are implemented through “iterators”. Iterators carry chunks of data relevant to the function’s execution (we call these chunks of data “tiles”). Sequence functions receive one or more input iterators, perform required calculations and write the result to an output iterator. The output iterator, in turn, is passed into the next function in the pipeline as an input iterator, building a pipeline that moves data from the database storage through the set of operations up to the result set in memory. The “nodes” in this pipeline are operations, while the edges (“channels”) are iterators. The interim operation results are not materialized in memory. Instead the “tiles” of elements are passed through the pipeline, where each tile is referenced by an iterator. The tile is the unit of data exchange between the operators in the pipeline. The tile size is small enough keep the tile in the top-level L1 CPU cache and large enough to allow for efficient use of superscalar and vector capabilities of modern CPUs. For example, 128 time-series elements fit into a 32K cache. Hence the term “on-chip” or “in-chip” analytics. As mentioned, top-level cache access is 3 times faster than level two (L2) cache access.
To illustrate the approach, consider the operation x*y + z where x ,y and z are large sequences, or vectors if you will (perhaps megabytes or even gigabytes). If the complete interim result of the first operation (x*y) is created, then at the moment the last element of it is received, the first element of the interim sequence is already pushed out of the cache. The second operation (+ z) would have to load it from memory. Tile-based pipelining avoids this scenario.
Q4. What are the main technical challenges you face when executing distributed query processing and ensuring at the same time high scalability and low latency when working with Big Data?
Andrei Gorine: Efficient Big Data processing almost always requires data partitioning. Distributing data over multiple physical nodes, or even partitions on the same node, and executing software algorithms in parallel allows for better hardware resource utilization through maximizing CPU load and exploiting storage media I/O concurrency. Software lookup and analytics algorithms take advantage of each node’s reduced data set through minimizing memory allocations required to run the queries, etc. However, distributed data processing comes loaded with many challenges. From the standpoint of the database management system, the challenges are two-fold: distributed query optimization and data distribution
First is optimizing distributed query execution plans so that each instance of the query running on a local node is tuned to minimize the I/O, CPU, buffer space and communications cost. Complex queries lead to complex execution plans. Complex plans require efficient distribution of queries through collecting, sharing and analyzing statistics in the distributed setup. Analyzing statistics is not a trivial task even locally, but in the distributed system environment the complexity of the task is an order of magnitude higher.
Another issue that requires a lot of attention in the distributed setting is runtime partition pruning. Partition pruning is an essential performance feature. In a nutshell, in order to avoid compiling queries every time, the queries are prepared (for example, “select ..where x=10” is replaced with “select ..where x=?”). The problem is that in the un-prepared form, the SQL compiler is capable of figuring out that the query is best executed on some known node. Yet in the second, prepared form, that “best” node is not known to the compiler. Thus, the choices are either sending the query to every node, or locating the node with the given key value during the execution stage.
Even when the SQL execution plan is optimized for distributed processing, the efficiency of distributed algorithms heavily depends on the data distribution. Thus, the second challenge is often to figure out the data distribution algorithm so that a given set of queries are optimized.
Data distribution is especially important for JOIN queries — this is perhaps one of the greatest challenges for distributed SQL developers. In order to build a truly scalable distributed join, the best policy is to have records from all involved tables with the same key values located on the same node. In this scenario all joins are in fact local. But, in practice, this distribution is rare. A popular JOIN technique is to use “fact” and “dimension” tables on all nodes while sharding large tables. However, building dimension tables requires special attention from application developers. The ultimate solution to the distributed JOIN problem is to implement the “shuffle join” algorithm. Efficient shuffle join is, however, very difficult to put together.
Q5. What is the STAC M3 Kanaga Benchmark and what is it useful for?
Andrei Gorine: STAC M3 Kanaga simulates financial applications’ patterns over large sets of data. The data is represented via simplified historical randomized datasets reflecting ten years of trade and quote (TAQ) data. The entire dataset is about 30 terabytes in size. The Kanaga test suite consists of a number of benchmarks aimed to compare different aspects of its “System Under Test” (SUT), but mostly to highlight performance benefits of the hardware and DBMS software utilized. The Kanaga benchmark specification was written by practitioners from global banks and trading firms to mimic real-life patterns of tick analysis. Our implementations of the STAC-M3 benchmark aim to fully utilize the underlying physical storage I/O channels, and maximize CPU load by dividing the benchmark’s large dataset into a number of smaller parts called “shards”. Based on the available hardware resources, i.e. the number of CPU cores and physical servers, I/O channels, and sometimes network bandwidth, the number of shards can vary from dozens to hundreds. Each shard’s data is then processed by the database system in parallel, usually using dedicated CPU cores and media channels, and the results of that processing (calculated averages, etc.) are combined into a single result set by our distributed database management system.
The Kanaga test suite includes a number of benchmarks symptomatic of financial markets application patterns:
An I/O bound HIBID benchmark that calculates the high bid offer value over period of time — one year for Kanaga. The database management system optimizes processing through parallelizing time-series processing and extensive use of single instruction, multiple data (SIMD) instructions, yet the total IOPS (Inputs/Outputs per Second) that the physical storage is capable of is an important factor in receiving better results.
The “market snapshot” benchmark stresses the SUT — the database and the underlying hardware storage media, requiring them to perform well under high-load parallel workload that simulates real-world financial applications’ multi-user data access patterns. In this test, the (A) ability to execute columnar-storage operations in parallel, (B) efficient indexing and (C) low storage I/O latency play important roles in getting better results.
The volume-weighted average bid (VWAB) benchmarks over a one day period. On the software side, the VWAB benchmarks benefit from the use of the columnar storage and analytics function pipelining discussed above to maximize efficient CPU cache utilization and CPU bandwidth and reduce main memory requirements . Hardware-wise, I/O bandwidth and latency play a notable role.
Andrei Gorine, Chief Technical Officer, McObject.
McObject co-founder Andrei leads the company’s product engineering. As CTO, he has driven growth of the eXtremeDB real-time embedded database system, from the product’s conception to its current wide usage in virtually all embedded systems market segments. Mr. Gorine’s strong background includes senior positions with leading embedded systems and database software companies; his experience in providing embedded storage solutions in such fields as industrial control, industrial preventative maintenance, satellite and cable television, and telecommunications equipment is highly recognized in the industry. Mr. Gorine has published articles and spoken at many conferences on topics including real-time database systems, high availability, and memory management. Over the course of his career he has participated in both academic and industry research projects in the area of real-time database systems. Mr. Gorine holds a Master’s degree in Computer Science from the Moscow Institute of Electronic Machinery and is a member of IEEE and ACM.
Follow us on Twitter: @odbmsorg