Skip to content

"Trends and Information on AI, Big Data, Data Science, New Data Management Technologies, and Innovation."

This is the Industry Watch blog. To see the complete ODBMS.org
website with useful articles, downloads and industry information, please click here.

May 5 21

On Amazon DocumentDB. Interview with Barry Morris

by Roberto V. Zicari

“We built DocumentDB to implement the Apache 2.0 open source MongoDB APIs, specifically by emulating the responses that a MongoDB client expects from a MongoDB server. We don’t support 100 percent of the APIs today, but we do support the vast majority that customers actually use. We continue to work back from customers and support additional APIs that customers ask for.” — Barry Morris.

I have interviewed Barry MorrisGM ElastiCache, Timestream and DocumentDB at AWS. We talked about DocumentDB

RVZ.

Q1. AWS has many database services now. Why DocumentDB? Why did you build it?

Barry Morris: At AWS we believe customers should choose the right tool for the right job, and we don’t believe there’s a one size fits all database given the variety and scale of applications out there. Customers using our purpose-built databases don’t have to compromise on the functionality, performance, or scale of their workloads because they have a tool that is expressly designed for the purpose at hand. In the case of Amazon DocumentDB (with MongoDB compatibility) we offer a fast, scalable, highly available, and fully managed document database service that is purpose-built to store and query JSON.

We built Amazon DocumentDB because customers kept asking us for a flexible database service that could scale document workloads with ease. Amazon DocumentDB has made it simple for these customers to store, query, and index data in the same flexible JSON format that is generated in their applications, so it is highly intuitive for their developers. And it achieves this expressive document query support while also maintaining the high availability, performance, and durability required for modern enterprise applications in the cloud. Similar to our other AWS purpose-built database services, Amazon DocumentDB is fully managed, so customers can scale their databases with clicks in the console rather than executing a planning exercise that takes weeks.

Finally, because many of our customers with document database needs are already enthusiastic about and familiar with the MongoDB APIs, we designed Amazon DocumentDB to implement the Apache 2.0 open source MongoDB APIs. This allows customers to use their existing MongoDB drivers and tools with Amazon DocumentDB, and to migrate directly from their self-managed MongoDB databases to Amazon DocumentDB. It also gives them the freedom to migrate data in and out of DocumentDB without fear of lock-in.

Q2. Who is using DocumentDB and for what?

Barry Morris: Amazon DocumentDB is being used today by a wide variety of customers, from longstanding global enterprises like Samsung and Capital One, to digital natives like Rappi and Zulily, to financial organizations like FINRA. In addition, several products that Amazon customers use, such as the Fulfillment by Amazon (FBA) experience on Amazon.com, are powered by Amazon DocumentDB. We have customers in virtually every industry, from financial services to retail, from gaming to manufacturing, from media and entertainment to publishing, and more.

Many of our customers are software engineering teams who don’t want to deal with the “undifferentiated heavy lifting” of database administration, such as hardware provisioning, patching, setup, and configuration. These organizations would rather allocate their valuable engineering talent to building core application functionality, rather than deploying and managing MongoDB clusters. One of our customers, Plume, saved themselves the cost of “three to five approximately $150,000 Silicon Valley salaries” which both offset the managed service cost and allowed their team to focus on their core mission to deliver a superior wireless internet experience. Further, DocumentDB allows Plume to scale much more than their previous solution, with one of their clouds handling as many as 50,000 API calls per minute. You can read the full case study here.

The customer use cases are wide and many, given that document databases offer both flexible schemas and extensive query capabilities. Some of the more traditional use cases for document databases include catalogs, user profiles, and content management systems; and with the scale that AWS and Amazon DocumentDB provide, we are seeing customers deploy document databases for a much wider range of internet-scale use cases, including critical customer-facing e-commerce applications and production telemetry.

Q3. What has been the customer response?

Barry Morris: As with all AWS services, we work very closely with DocumentDB customers to ensure we are building a service that works backward from their needs. To date, the feedback we get is that customers are thrilled by DocumentDB’s ease of scaling, its fully managed capabilities, its natural integration with other AWS offerings, its durability and general enterprise-readiness, and its straightforward API compatibility with MongoDB. Of course, we are always working to add capabilities and features that are highly requested. For example, we just improved our MongoDB compatibility by adding support for frequently requested APIs such as renameCollection, $natural, and $indexOfArray. In the coming months, we also plan to release one of our most-requested features, Global Clusters, for customers with cross-region disaster recovery and data locality requirements. We also continue to bolster our MongoDB compatibility by adding support for the APIs that customers use the most.

Q4. What are the main design features of Amazon DocumentDB?

Barry Morris: Amazon DocumentDB has been built from the ground up with a cloud native architecture designed for scaling JSON workloads with ease. An essential design feature of DocumentDB is that it decouples compute and storage, allowing each to scale independently. Because storage and compute are separate, customers can add replicas without putting additional load on the primary. This allows you to easily scale out read capacity to millions of requests per second by adding up to 15 low latency read replicas across three AWS Availability Zones (AZs) in minutes. DocumentDB’s distributed, fault-tolerant, self-healing storage system auto-scales storage up to 64 TB per database cluster without the need for sharding, and without any impact or downtime to a customer’s application.

As I mentioned before, DocumentDB is built to be enterprise-ready. It provides strict network isolation with Amazon Virtual Private Cloud (VPC). All data is encrypted at rest with AWS Key Management Service (KMS) and encryption in transit is provided with Transport Layer Security (TLS). DocumentDB has compliance readiness with a wide range of industry standards, and automatically and continuously monitors and backs up to Amazon S3, which is highly durable.

Q5. When would you suggest to use DocumentDB vs another purpose-built database?

Barry Morris: At its core, DocumentDB is designed to store, index, and query rich and complex JSON documents with high availability and scalability. You can retrieve documents based on nested field values, join data across collections, and perform aggregation queries. So if you need schema flexibility and the ability to index and query rich structured and semi-structured documents, DocumentDB is a great choice. This is particularly true if you have JSON document workloads that are mission critical for your organization. A DocumentDB cluster provides 99.99% availability, can handle tens of thousands of writes per second and millions of reads per second, and supports up to 64 TiB of data. Finally, since DocumentDB supports MongoDB workloads and is compatible with the MongoDB API, it is a logical choice for MongoDB users who are looking to easily migrate to a fully managed database solution. Every use case is unique, and it is often a good idea to engage an AWS solution architect (SA) if you have questions about selecting the right database for your next application.

Q6. What are the key advantages of DocumentDB vs managing your own cluster?

Barry Morris: For many customers, fully managed is all about scale. We scale your database at the click of a button, saving you nights and weekends of scaling clusters manually. Customers don’t have to worry about provisioning hardware, running the service, configuring for high availability, or dealing with patching and durability. These concerns are shifted to AWS, so our customers can focus on their applications and innovate on behalf of their customers. Something as simple as backup and restore can be a drag on production. With DocumentDB, backup is on by default.

Cost is also a big concern when managing your own clusters. This can include the cost of labor resources, hardware investments, vendor software solutions, support costs, and more. Cost becomes very transparent with DocumentDB, as it offers pay-as-you-go pricing with per second instance billing. You don’t have to worry about planning for future growth, because DocumentDB scales with your business.

Q7. Tell me about “MongoDB compatibility” – what does that really mean in practice?

Barry Morris: That’s a great question and one we get a lot from customers. We built DocumentDB to implement the Apache 2.0 open source MongoDB APIs, specifically by emulating the responses that a MongoDB client expects from a MongoDB server. We don’t support 100 percent of the APIs today, but we do support the vast majority that customers actually use. We continue to work back from customers and support additional APIs that customers ask for. Because we offer MongoDB API compatibility, it’s straightforward to migrate from the MongoDB databases you’re managing on premises or in EC2 today to DocumentDB. Updating the application is as easy as changing the database endpoint to the new Amazon DocumentDB cluster.

Q8. Let’s hear about some exciting customer momentum. Can you please share some customer stories?

Barry Morris: We have a lot of them! Customers including BBC, Capital One, Dow Jones, FINRA, Samsung, and The Washington Post have shared their success stories with us. Recently, we’ve done some deeper-dive case studies with customers in a range of industries.

For example, Zulily presented their solution at AWS re:Invent 2020. The popular online retailer is using Amazon DocumentDB along with Amazon Kinesis Data Analytics to power its “suggested searches” feature. In this solution, Kinesis Data Analytics filters relevant events from clickstream analytics when a Zulily customer requests a search, a Lambda function performs a lookup for brands and categories relevant to those events, and the resulting enriched events — which populate the suggested search — are stored in DocumentDB. The feature has been a hit, with more than 75% of Zulily customers using suggested searches when they search the online store.

A customer story that is particularly compelling given recent events is Rappi. Rappi is a successful Colombian delivery app startup that operates in nine Latin American countries. The company had been rearchitecting their monolithic application into a more flexible, microservices-driven architecture to help it scale as it grew. As part of this modernization effort, the startup selected DocumentDB as a fully managed, purpose-built JSON database service to replace its self-managed MongoDB clusters, which were becoming unwieldy to manage at scale. When Covid-19 hit, the company faced an unprecedented surge in orders and deliveries. DocumentDB enabled them to handle the surge because, as a highly scalable service, it operated as normal despite the change in volume. Overall, Rappi decreased management and operational overhead by more than 50% using Amazon DocumentDB.

A final one I will mention is Asahi Shimbun, which is one of Japan’s oldest and largest-circulated newspapers. The company overhauled its digital app last year using AWS and selected Amazon DocumentDB as their content master database to store their articles. Since modernizing, Asahi Shimbun has seen a 30% reduction in monthly operation costs for extracting past articles and a 20% improvement in frequency of use for the app. This is one of many examples that showcase how essential AWS is for industries like publishing, retail, and banking that are evolving with new business models in the cloud.

You can peruse these and many other customer case studies in full on our website.

Q9. Anything else you wish to add?

Barry Morris: Over the last decade, JSON/document-based workloads have become one of the primary alternatives to relational approaches, for a wide range of applications with requirements for flexible data management. We expect this trend to keep growing, particularly with cloud-native applications, and we’re excited to offer DocumentDB as a tool in the toolkit of modern builders leveraging JSON. It’s been great to see DocumentDB support the needs not only of customers who are migrating their existing MongoDB workloads to the cloud, but also the builders who are creating modern applications and choosing DocumentDB as the right “purpose-built database” for their needs.

For anyone interested in learning more and getting hands-on with DocumentDB, we have a number of things coming up that may be of interest. We will be hosting two DocumentDB Focus Days, which are virtual workshops on best practices, in May and June. You can learn more and sign up on the registration page.  Finally, we have an ongoing Twitch series where our solution architects (SAs) dive deeper on DocumentDB functionality, which you can learn more about on the website. Our DocumentDB product detail page is the best place to start for a general overview of the service and steps to get started, and you can refer to the documentation for an in-depth developer guide.

………………………….

Picture1

Barry Morris, GM ElastiCache, Timestream and DocumentDB. As General Manager of ElastiCache, Timestream and DocumentDB, Barry manages a number of businesses in the AWS database portfolio.  He is focused on delivering value to AWS customers through trusted data management services, with a relentless commitment to database innovation.

Prior to joining AWS in 2020, his career includes over 20 years as the CEO of international technology companies, both private and public, including Undo.io, NuoDB, StreamBase, Headway, and IONA Technologies. Barry has also had leadership roles in PROTEK, Metrica, Lotus Development and DEC. 

Born in South Africa, Barry lived in England and Ireland before moving to Boston. He holds a Bachelor’s Degree (BA) in engineering from Oxford University and an Honorary Doctorate in Business Administration (DBA) from the IMCA.

Resources

– Get Started with Amazon DocumentDB

Related Posts

– From SQL to NoSQL. Interview with Carlos Fernández. by Roberto V. Zicari.ODBMS Industry Watch, April 30, 2021

Follow us on Twitter: @odbmsorg

Apr 30 21

From SQL to NoSQL. Interview with Carlos Fernández

by Roberto V. Zicari

“We like to say that we have the biggest database on companies and sole proprietors in Spain. We handle 7 million national economic agents, and the database undergoes more than 150,000 daily information updates. We have been active since 1992, so our historic file is massive. The database as a whole exceeds 40 Terabytes.” –Carlos Fernández

I have interviewed Carlos Fernández Deputy General Manager at INFORMA Dun & Bradstreet. We talked about their use of the LeanXcale database.

RVZ

Q1. Could you describe in a few words what Informa Dun & Bradstreet is and what its figures are?

Carlos Fernández: Informa D&B is the leading business information services company for customer and supplier acquisitions, analyses and management. We maintain this leadership in the three markets in which we compete: Spain, Portugal and Colombia.

We like to say that we have the biggest database on companies and sole proprietors in Spain. We handle 7 million national economic agents, and the database undergoes more than 150,000 daily information updates. We have been active since 1992, so our historic file is massive. The database as a whole exceeds 40 Terabytes.

To maintain and update this massive database, we invest 12 million euros every year in data and data handling procedures and systems, and we have 130 data specialists that take care of every single piece of information that we load into the database. Data quality, accuracy and timeliness as well as the coherence between different sources are essential for us.

Q2. I understand that Informa D&B has begun a profound update of its data architecture in order to continue being a market leader for another 10 years. What does the update consist of?

Carlos Fernández: We really began updating when gigabytes were insufficient for our needs. Now we see that terabytes will follow the same path. Petabytes are the future, and we need to be prepared for it. We usually say that when you need to travel to another continent, you need an airplane, not a car.

What does this mean in practical terms? Our customers are used to online responses to their needs. However, these needs have become more complex and require greater data depth.

If you are able to store hundreds of terabytes, use them very quickly and use complex analytic models to easily find the answer to your question, then you are in good shape.

To fulfill these requirements, a Data Lake orientation is really a must, and solutions like LeanXcale will become key factors in our new architectural approach.

Q3. You mentioned that you have found a new database manager, LeanXcale, to address the challenges for your data platform. What kind of database manager were you using before and why are you replacing it?

Carlos Fernández: INFORMA was, and still is, an “Oracle” company. Having said that, the more we began to move into a Data Lake design, the more new solutions and new names came into play. Mongo, Cassandra, Spark …

So, having come from an SQL-oriented environment featuring many lines of code, we wondered if we could fulfill our new requirements with the old technology. The answer to that query is a clear NO. Can we rewrite INFORMA as a whole? The answer is again NO. Can we meet our new requirements by increasing our computing capacity? Once more, the answer is NO.

We needed to be smart and find a solution that could bring positive outcomes in an affordable technical environment.

Q4. According to you, one of the main improvements has been the acceleration of the process through leveraging the interfaces of LeanXcale with NoSQL and SQL. Can you elaborate on how it helped you?

Carlos Fernández: As I mentioned before, we have quite challenging business and product performance requirements. On the other hand, business rules are also complex and difficult to rewrite for different environments.

Can we solve our issues without a huge investment in expensive servers? Can we also accommodate these requirements in a scalable fashion?

LeanXcale and its NoSQL and SQL interfaces were the perfect match for our needs.

Q5. What are the technical and business benefits of having a linear scaling database such as LeanXcale?

Carlos Fernández: We have many customers. They range from the biggest Spanish companies to small businesses and sole proprietors. They have completely different needs, but, at the same time, they share many requirements, with the main one being immediate response time.

Of course, the amount of data and model complexity involved in generating a response can vary quite a lot, depending on the size of the company and its portfolio.

Only by being able to accommodate such demands with a scalable solution can we provide the required services under a valid cost structure

Q6. How was your experience with LeanXcale as a provider?

Carlos Fernández: For us, this has been quite an experience. From the very beginning, the LeanXcale team acted as though they worked for INFORMA.

We started with a POC, and it was not an easy one. We had the feeling that we had the best parts of the company involved in the project. Well, not really the feeling since that really was the case.

The key factor, however, was the team’s knowledge, that is, the depth of their technical approach, the extent to which they understood our needs and their ability to reshape many aspects to make our requirements a reality.

Q7. You said that LeanXcale has a high impact on reducing total cost of ownership. Could you provide us with figures comparing it to the previous scenario?

Carlos Fernández: LeanXcale has reduced our processing time by more than 72 times over. The standard LeanXcale licensing and support price means savings of around 85%. In our case, we have maximized these savings by signing an unlimited License Agreement for the next five years.

Additionally, this improved performance reduces the infrastructure used in our hybrid cloud by the same proportion: 72 times over.

However, these savings are less crucial than the operational risk reduction and the enablement of new services. Being ready to react to any unexpected event quickly makes our business more reliable. New services will allow us to maintain our market leadership for the next decade.

Q8. How will this new technology affect the services offered to the customer?

Carlos Fernández: I think that we can consider two periods of time in the answer.

Right now, we are capable to improving our actual product range features. We can deliver updated external databases faster and more frequently and offer a better customer experience in many areas. We can provide more data and more complex solutions to a wider range of customers.

For the future, we are discovering new ways to design new products and services. When you break down barriers, new ideas come up quite easily. Our marketing team is really excited about the new capabilities we will have. I am sure that we will shortly see many new things coming from us.

QX. Anything else you wish to add?

Carlos Fernández: INFORMA D &B is a company that has put innovation at the top of its strategy. We never stop and will find new opportunities through using LeanXcale. We are very pleased and very sure that we will be a market leader for many years to come!

——————————————

Picture 1

Carlos Fernández holds a Superior Degree in Physics and an MBA from the “Instituto de Empresa” in Madrid. His professional career has included stints at companies such as Saint Gobain, Indra, Reuters and Fedea.

At the present time, he is Deputy General Manager at INFORMA and a member of the board of the XBRL Spanish Jurisdiction. In addition, he is a member of the Alcobendas City Council’s Open Data Advisory Board. This entity is firmly committed to continue advancing and publishing information in a reusable format to generate social and economic value.

Furthermore, he is a former member of various boards, including the boards of ASNEF Logalty, ASFAC Logalty and CTI.

He is a former member of GREFIS (Financial Information Services Group of Experts) and a current member of XBRL CRAS (Credit Risk Services), for which he is Vice President of the Technical Working Group. He is also a former member of the Information Technologies Advisory Council (CATI) and the AMETIC Association (Multi-Sector Partnership of Electronics, Communications Technology, Telecommunications and Digital Content Companies).

Resources

YouTube: LeanXcale’s success story on Informa D&B by Carlos Fernández Iñigo, CTO at Informa D&B

Related Posts

On Digital Transformation, Big Data, Advanced Analytics, AI for the Financial Sector. Interview with Kerem Tomak, by Roberto V. Zicari, ODBMS Industry Watch. July 8, 2019

Follow us on Twitter: @odbmsorg

##

Apr 21 21

On C++ Debugging. Interview with Greg Law

by Roberto V. Zicari

“Like it or not, debugging is part of programming. There is a lot of research and cool technology about preventing bugs (programming language features or design decisions that make certain bugs impossible) or catching bugs very early (through static or dynamic analysis or better testing), and all this is of course laudable and good stuff. But I’ve often been struck by how little attention is placed on making it easier to fix those bugs when they inevitably do happen.” — Greg Law

Q1: You are a prolific speaker at C++ conferences and podcasts. In your experience, who is still using C++?

Greg Law: C++ is used widely and its use is growing. I see a lot of C++ usage in Data Management, Networking, Electronic Design Automation (EDA), Aerospace, Games, Finance, etc.

It’s probably true that use of some other languages – particularly JavaScript and Python – is growing even faster, but those languages are weak where C++ is strong and vice versa. Go is growing a lot and Rust is getting a lot of attention right now and has some very attractive properties. 10-15 years ago, it felt almost like programming languages were “done” but these days, we’re seeing a lot of innovation both in terms of new or newish languages, and development of older languages. Even plain old C is seeing a bit of a resurgence. We are going to continue living in a multi-language world; I expect C++ to remain an important language for a long while yet.

Q2: In my interview with Bjarne Stroustrup last year, he spoke about the challenge of designing C++ in the face of contradictory demands of making the language simpler, whilst adding new functionality and without breaking people’s code. What are your thoughts on this?

Greg Law: I totally agree. I think all engineering is about two things – minimising mistakes and making tradeoffs (i.e. judgements). Mistakes might be a miscalculation when designing a bridge so that it won’t stand up or an off-by-one error in your program – those are clearly undesirable, we don’t want those. A tradeoff might be between how expensive the bridge is to build and how long it will last, or how long the code takes to write and how fast it runs.

But tradeoffs are relevant when it comes to reducing errors too – what price should we pay to avoid errors in our programs? How much extra time are we prepared to spend writing or testing it to get the bugs out? How far do we go tracking down those flaky 1-in-a-thousand failures in the test-suite? Are we going to sacrifice runtime performance by writing it in a higher-level and less error-prone language? Alternatively, we could choose to make that super-clever optimisation about which it’s hard to be confident it is correct today and even harder to be sure it will remain correct as the code around it changes; but is the runtime performance gain worth it, given the uncertainty that has been introduced? It’s counterintuitive, but actually there is an optimal bugginess for any program – we inevitably trade off cost of implementation and performance against potential bugs.

It’s probably fair to say however that most programs have more bugs than is optimal! I think it’s also true that human nature means we tend to under-invest in dealing with the bugs early, particularly flaky tests. We always feel “this week is particularly busy, I’ll part that and take a look next week when I’ll have a bit more time”; and of course next week turns out to be just as bad as this week.

Q3: I understand Undo helps software engineering teams with debugging complex C/C++ code bases. What is the situation with debugging C/C++? What are you seeing on the ground?

Greg Law: Like it or not, debugging is part of programming. There is a lot of research and cool technology about preventing bugs (programming language features or design decisions that make certain bugs impossible) or catching bugs very early (through static or dynamic analysis or better testing), and all this is of course laudable and good stuff. But I’ve often been struck by how little attention is placed on making it easier to fix those bugs when they inevitably do happen. The situation is not unlike medicine in that prevention is better than cure, and the earlier the diagnosis the better; but no matter what we do, we will always need cure (unlike medicine we have the balance wrong the other way round – in medicine we spend way too much on cure vs prevention!).

It’s all about tradeoffs again. All else being equal, we’d ensure there are no bugs in the first place; but all else never is equal, and how high a price can we afford on prevention? And actually if you make diagnosis and fixing cheaper, that further reduces how much you need to spend on prevention.

The harsh reality is that close to none of the software out there today is truly understood by anyone. Humans just aren’t very good at writing code, and economic pressure and other factors mean we add and fix tests until our fear of delivering late outweighs our fear of bugs. This is compounded as code ages; people move on from the project, bugs get fixed by adding a quick hack, further increasing the spaghettification. Like frogs in boiling water, we’ve kind of become so used to it that we don’t notice how awful it is any more!

People routinely just disable flaky failing tests because they can’t root-cause them. Over a third of production failures can be traced back directly or indirectly to a test that was failing and was ignored.

Q4: You have designed a time travel debugger for C/C++. What is it for?

Greg Law: Debugging is really answering one question: “what happened?”. I had certain expectations for what my code was going to do and all I know is that reality diverged from those expectations. Traditional debuggers are of limited help here – they don’t tell you what happened, they just tell you what is happening right now. You hit a breakpoint, you can look around and see what state everything is in, and either it looks all good or you can see something wrong. If it’s good, set another breakpoint and continue. If it’s bad… well, now you want to know what happened, how it became bad. The odds of breaking just at the right point and stepping your code through the badness are pretty long. So you run again, and again, if you’re lucky vaguely the same thing happens each time so you can home in on it; if not, well… you’re in trouble.

With a time travel debugger like UDB, it’s totally different – you see some piece of state is bad, you can just go backwards to find out why. Watchpoints (aka data breakpoints) are super powerful here – you can watch the bad piece of data and run backwards and have the debugger take you straight to the line of code that last modified it. We have customers who have been trying to fix something for literally years who with a couple of watch + reverse-continue operations had it nailed in an hour.

Time travel debuggers are really powerful for any bug where a decent amount of time passes between the bug itself and the symptoms (assertion failure, segmentation fault, bad results produced). They are particularly useful when there is any kind of non-determinism in the program – when the bug only occurs one time in a thousand and/or every time you run the program it fails at a different point in or a different way. Most race conditions are examples of this; so are many memory or state corruption bugs. It can also help to diagnose complex memory leaks. Most leak detectors or static analysis help with the trivial issues( say you returned an error and forgot to add a free) but not the hard ones (for example when you have a reference counting bug and so the reference never hits zero and the resources don’t get cleaned up).

This new white paper provides more insight into what kind of bugs time travel debugging helps with *. It’s not uncommon for software engineers to spend half their time debugging, so it’s a must-read for anyone who wants to increase development team productivity.

By the way, Time Travel Debugging is also sometimes known as Replay Debugging or Reverse Debugging.

Q5: Since you say it lets you see what happened, could it help with code exploration too?

Greg Law: Funny you say that. This is a use case it wasn’t initially designed for, but many engineers are using it to explore unfamiliar codebases they didn’t write. They use it to observe program behaviour by navigating forwards and backwards in the program’s execution history, examine registers to find the address of an object etc. They say there’s a huge productivity benefit in being able to go backwards and forwards over the same section of code until you fully understand what it does. Especially as you’re trying to understand a certain piece of code, and there are often millions of lines you don’t care about right now, it’s easy to get lost. When that happens you can go straight back to where you were and continue exploring.

Debugging is about answering “what did the code do” (ref. cpp.chat podcast on setting a breakpoint in the past **); but there are other activities that involve asking that same question. As I say, most code out there is not really understood by anyone.  

Q6: What are your tips on how to diagnose and debug complex C++ programs?

Greg Law: The hard part about debugging is figuring out the root cause. Usually, once you’ve identified what’s wrong, the fix is quite simple. We once had a bug that sunk literally months of engineering time to root cause, and the fix was a single character – that’s extreme but the effect it’s illustrating is very common.

Identifying the problem is an exercise in figuring out what the code really did as opposed to what you expected. Somewhere reality has diverged from your expectations – and that point of divergence is your bug. If you’re lucky, the effects manifest soon after the bug – maybe a NULL pointer is dereferenced and you needed a check for NULL right before it. But more often that pointer should never be NULL, the problem is earlier.

The answer to this is multi-pronged:

1. Liberal use of assertions to find problems as close to their root cause as possible. I reckon that 50% of assert fails are just bogus assertions, which is annoying but cheap to fix because the problem is at the very line of code that you notice. The other 50% will save you a lot of time.

2. If you see something not right, do not sweep it under the carpet. This is sometimes referred to as ‘smelling smoke’. Maybe it’s nothing, but you better go and look and see if there’s a fire. When you’re smelling smoke, you’re getting close to the root cause. If you ignore it, chances are that whatever the underlying cause of the weirdness is, it will come back and bite you in a way that gives you much less of a clue as to what’s wrong, and it’ll take you a lot longer to fix it. Likewise don’t paper over the cracks – if you don’t understand how that pointer can be NULL, don’t just put a check for NULL at the point the segv happened.

This most often manifests itself in people ignoring flaky test failures. 82% of software companies report having failing tests that were not investigated that went on to cause production failures *** (the other 18% are probably lying!). Working in this way requires discipline – following that smell of smoke or fixing that flaky test that you know isn’t your fault will be a distraction from your proximate goal. But when something is not right, or not understood, ignoring it now is going to cost you a lot of time in the long run.

3. Provide a way to know what your code is really doing. The trendy term is observability. This can be good old printf or some more fancy logging. An emerging technique is Software Failure Replay, which is related to Time-Travel Debugging. Here you record the program execution (a failed process), such that a debugger can be pointed at the execution history and you can go back to any line of code that executed and see full program state. This is like the ultimate observability. Discovering where reality diverged from your expectations becomes trivial.

————————————-

Greg Law Headshot 2018

Dr Greg Law is the founder of Undo, the leading Software Failure Replay platform provider. Greg has 20 years’ experience in the software industry prior to founding Undo and has held development and management roles at companies, including Solarflare and the pioneering British computer firm Acorn. Greg holds a PhD from City University, London, and is a regular speaker at CppCon, ACCU, QCon, and DBTest.

Resources

* White Paper: Increase Development Productivity with Time Travel Debugging

** cpp.chat podcast – Setting a Breakpoint in the Past

*** Freeform Dynamics Analyst Report – Optimizing the software supplier and customer relationship

Related Posts

Thirty Years C++. Interview with Bjarne Stroustrup. by Roberto V. Zicari.ODBMS Industry Watch. July 23, 2020

Follow us on Twitter: @odbmsorg

 

Apr 7 21

On the new Tortoise Global AI Index. Interview with Alexandra Mousavizadeh.

by Roberto V. Zicari

“I think the conversation is really about a shift of where funding is coming from. Governments are spending far less than the big tech platforms. What this tells us about who owns the direction of travel of AI is fascinating. Are we now in a position where the power that the public sector was able to deploy in the past is massively outgunned by private companies and their R&D budgets?” — Alexandra Mousavizadeh.

This is my follow up interview with Alexandra Mousavizadeh,  Partner at Tortoise Media. We talked about the new version of the Global AI Index.

RVZ

Q1. In December 2020, Tortoise Media launched the Global AI Index to benchmark nations on their level of investment, innovation and implementation of artificial intelligence. Since then, your team has been working on expanding the Index. What are the main results of this new Index?

Alexandra Mousavizadeh: The most striking result is China’s rapid improvement. Although the gap between the top three is significant (the US is still streets ahead of China, and the UK lags far behind in third place) across our 143 indicators, China has made gains. That rise is mostly due to a serious boost in research: yet another Chinese university joined the Times’ list of top 100 computer science universities; the total number of citations from high-achieving Chinese computer science academics jumped by 67 per cent over the course of the year; the number of Chinese academic AI papers accepted by the IEEE – a body which sets AI standards and also publishes a number of influential AI journals – now out-do those by US academics by a factor of seven; and China overtook the US in terms of AI patents granted around two years ago and has been pulling further ahead ever since.

China is also pulling ahead in the roll-out of supercomputers, with almost twice as many super computers as the US, demonstrating its growing threat to the US’ AI supremacy.

The UK has slipped in some key metrics, and its lead over its closest competitors has narrowed. Although a new AI strategy has now been announced, it’s not been published yet. We have also seen British slippage across several different key parts of the framework; universities, supercomputing, research, patents and diversity of specialists.

Q2. What has been added to the Global AI Index of this year?

Alexandra Mousavizadeh: We’ve added Armenia, Bahrain, Chile, Colombia, Greece, Slovakia, Slovenia and Vietnam: each of these countries has recently published a strategic approach to artificial intelligence on a national level, and therefore ‘qualified’ for assessment under The Global AI Index framework. That brings the number of countries assessed to 62 overall, up from 54 last year.

We’ve also developed all-new national AI dashboards: policy makers can monitor their national AI activity across all 143 indicators in real-time. These dashboards, the first of their kind, can also simulate the impact of policies via the target-setting feature, which calculates hypothetical ranks and scores based on chosen policy targets.

Q3. What new metrics did you introduce and why?

Alexandra Mousavizadeh: We’ve added a range of new metrics to deepen our measurements across many of the pillars of the index. In Talent, we have incorporated data provided by Coursera, showing the level of enrollment and activity on online learning courses specific to ‘artificial intelligence’ and ‘machine learning’. This data-set fills in gaps in countries like India, China and Russia – where our other metrics were not as comprehensive – by acting as a proxy for the level of online learning taking place.

We’ve also refined some of our existing metrics to increase the accuracy of our data. We’ve replaced the Open Data Barometer data-set, which was outdated in many respects, with measures from the OECD OURdata Index to better reflect the level of open data use and suitability. The new source is much more recent and relevant. We’ve also made our measures of 5G implementation more granular, reflecting the actual level of supported networks in a given country. This is a crucial leading indicator for the capacity to adopt artificial intelligence more widely in a given country,  so this change offers a lot more clarity in the Infrastructure pillar of the index.

In our forthcoming update for the index in May, we’ll be adding more indicators on diversity issues. These will complement a component of the index that deals with regulation and ethics.

Q4. Who are the biggest risers on the Index?

 Alexandra Mousavizadeh: Israel has surged up the rankings from 12th to 5th place. It improved its talent rank, with more R package downloads, an increase in stack overflow questions and answers, and a rise in GitHub commits. These are important indicators because they speak to developments in a country’s coding community beyond the formal education sector. It also still has the highest number of startups as a proportion of the population – 3 AI startups for every 100,000 people (compared with 5 startups for every million people in the US). It’s an impressive feat for a small country, but its rise is driven in part by the number of proportional, or intensity-based, metrics in the index (which favour small countries), and partly by changes in our methodology that more accurately capture the number of developers and other specialists on social media.

Also of note is Finland, which has procured a pre-exascale supercomputer and substantially increased its coding activity. Use of Python, R and commits on GitHub have grown, as well as our measures of GitHubs Stars and Stack Overflow Questions/Answers. This is a notable result of Finland’s ongoing focus on skills development, driven by both government strategy and an excellent ecosystem, including the Finnish AI Accelerator, the Tampere AI Hub and the AI Academy at University of Turku. That complements exciting tech startups, including Rovio, Supercell, and CRF Health.

 The Netherlands is the biggest riser; helped by a slow-down in some of the countries that previously ranked above them (Japan), The Netherlands have accelerated their coding and development activities and have high scores on the proportional Talent measures i.e. Number of Data Scientists per capita.

Q5. How has the pandemic influenced the global development of AI in the world? Do you have any insights to share on this?

Alexandra Mousavizadeh: It is difficult to untangle the exact effect of COVID on the global development of AI – a lot of the data is yet to come in. One area that has definitely suffered is start-ups. This time last year, the UK had 529 AI startups listed on Crunchbase, but it’s now dropped to 338. Other countries have seen similar collapses in startup numbers. There are some counterintuitive results: the 62 countries in our Index attracted similar overall levels of private funding for AI as in 2019. Several countries have seen a fall in investment, but these don’t necessarily match those that have been worst affected by the pandemic. Both the US and China, for instance, have seen a similar drop-off in funding, despite the US having a much worse pandemic overall.

The pandemic has also created challenges to which some countries have responded with AI. Coronavirus has become a focal point for Israel’s AI entrepreneurs, with several Israeli companies emerging as front-runners in areas like diagnostics, disease management and monitoring systems. Vocalis Health, for example, was launched by the defence ministry and is aiming to diagnose Covid effectively based on people’s speaking patterns.

Q6. The lack of transparency is a limiting factor for the effectiveness of the Index. For example in the case of Russia, much of its AI spending may be going to military purposes – and you can’t track it.  What is your take on this?

Alexandra Mousavizadeh: It’s true that there is AI spending and research that we can’t track, especially in countries that are less transparent, like Russia. We only use one proprietary data source for the Index – the Crunchbase API –  and the vast majority of the rest of our information is open source. Government spending on clandestine AI activity represents a small proportion of funding and progress in the AI space. Even if a country is spending quite large sums opaquely on covert AI, that spending is siloed, and often won’t contribute to a country’s overall progress in AI.

Q7. Isn’t this also the case for most of the other counties such as China, but even the USA?  They do not necessarily reveal their use of AI for military purposes.

 Alexandra Mousavizadeh: I think the conversation is really about a shift of where funding is coming from. Governments are spending far less than the big tech platforms. What this tells us about who owns the direction of travel of AI is fascinating. Are we now in a position where the power that the public sector was able to deploy in the past is massively outgunned by private companies and their R&D budgets? Amazon spends roughly ten times as much on R&D and infrastructure as DARPA’s total budget ($36bn and $3.5bn in 2019 respectively). The UK’s new research agency, ARIA, is set to have just £800m over the course of the next parliament.

Another related issue is that of selective publication. It’s well within a company’s rights to not release research that it carries out – but when that research has the potential to create massive public goods, it’s concerning that decisions are made in private by unaccountable tech companies.

Q9. How would you characterize the current geo-politics of artificial intelligence?

Alexandra Mousavizadeh: The countries that get on top of AI will accrue enormous benefits very quickly – not just in efficiency gains and cost reductions, but in transformative technologies that will increase almost every aspect of their global competitiveness. Although a lot of research that we’re currently seeing is business, not state-led, states can move rapidly if they have to.

Conventional wisdom says it’s a competitive environment, with some saying that we’re seeing a rise in AI nationalism. But the different gains and losses made by nations in different factors and metrics on the index show a complex picture of states investing in different areas. The World Economic Forum’s ‘Framework for Developing a National Artificial Intelligence Strategy’ highlights the collaborative role that many governments are aiming to play; co-designing, rather than merely responding to, technological change across multiple sectors. And many of the factors we track don’t stick to one country – talent crosses borders, and Github is global. A lot of the most interesting developments are iterated with open-source technology. So in many ways, the geo-politics of AI can be one of mutual benefit rather than a zero-sum environment.

It’s more complicated than any one story; but one narrative that we’re definitely seeing is two superpowers, the US and China, establishing themselves as dominant in the space. Then there are a number of specialist smaller states that could have a significant role to play in terms of standard-setting, and specialist AI in areas of national comparative advantage – take the UK and medtech for example. Finally, there are countries that want to be involved but aren’t currently in a place where their strategy (or level of investment) matches their ambitions. Those countries need to get serious, quickly.

AI is an accelerant – we run the risk of seeing clusters of AI excellence that exacerbate divides within and between nations, compounding existing inequalities and leaving those without skills and capital behind.

Q10. Carissa Veliz mentioned that “to ensure ethical behaviour around AI, certain behaviours should be banned by law” . Do you plan to include policy regulations in the future version of the Index?

Alexandra Mousavizadeh: The dashboards we developed have a set of implicit policy recommendations sitting behind the indicators; governments can see how their rankings would improve by using these features. But with regard to ethics, we spent hundreds of hours with the team discussing how regulation and ethics could feed into the index and concluded that it needed an in depth examination which warranted its own investigation. We are now doing that work. The May update to the index will contain a section on policy regulation, so stay tuned.

Having said this, there are some metrics in the Index which do address ethical considerations – such as the diversity of researchers in STEM subjects. This section will also be expanded in May.

Qx Anything else you wish to add?

Alexandra Mousavizadeh: We love hearing from our readers at Tortoise; the point is that we include them as part of the conversation. Our AI Sensemaker newsletter contains our and their thinking, keeping you posted on the latest developments in AI, and comes out once a fortnight. You can sign up here.  To get more involved, you can also apply to join our AI Network: it’s a global community of experts, policy makers, and business leaders, who take part in monthly round tables. These cutting-edge conversations set the pace for all things AI.

————————-

Alexandra

Alexandra Mousavizadeh is a Partner at Tortoise Media, running the Intelligence team which develops indices and data analytics. Creator of the recently released Responsibility100 Index and the new Global AI Index. She has 20 years’ experience in the ratings and index business and has worked extensively across the Middle East and Africa. Previously, she directed the expansion of the Legatum Institute’s flagship publication, The Prosperity Index, and all its bespoke metrics based analysis & policy design for governments. Prior roles include CEO of ARC Ratings, a global emerging markets based ratings agency; Sovereign Analyst for Moody’s covering Africa; and head of Country Risk Management, EMEA, Morgan Stanley.

Resources

The Global AI Index

The Global AI Index Methodology Report

The Tortoise Global AI Summit: the Readout. Thursday 10 December 2020

Related Posts

On The Global AI Index. Interview with Alexandra Mousavizadeh. ODBMS Industry Watch.by Roberto V. Zicari on January 18, 2020

Follow us on Twitter: @odbmsorg

##

Mar 18 21

On the Challenges Facing Financial Institutions. Interview with Joe Lichtenberg

by Roberto V. Zicari

“There are three factors C-suite executives need to consider when addressing operational resilience: the need to make better business decisions faster; improved automation and the elimination of manual processes; and the ability to respond to unexpected volume and valuation volatility.” –Joe Lichtenberg.

I have interviewed Joe Lichtenberg, responsible for product and industry marketing for data platform software at InterSystems.

RVZ

Q1. What are the main challenges financial institutions are facing right now?

Joe Lichtenberg: As financial services organizations are pushed to rapidly adapt due to the pandemic, they also want to gain a competitive edge, deliver more value to customers, reduce risk, and respond more quickly to the needs of distributed businesses. To not only stand out, but ultimately survive, financial services organizations have relied on their digital capabilities. For instance, many have adapted faster than anticipated and found ways to supplement traditional face-to-face customer service. As the volume of complex data grows and the need to use data for decision-making accelerates, it is becoming more difficult to reach their business goals and deliver differentiated service to customers at a faster rate.

Q2. What do you suggest to C-suite executives that could help them re-evaluate their operational resilience in light of increasing volumes and volatility, and the shift to a remote working environment (especially due to the COVID-19 crisis)?

Joe Lichtenberg: There are three factors C-suite executives need to consider when addressing operational resilience: the need to make better business decisions faster; improved automation and the elimination of manual processes; and the ability to respond to unexpected volume and valuation volatility. Executives need to prioritize their organizations’ ability to access and process a single representation of accurate, consistent, real-time and trusted data. The volatility and uncertainty fueled by the pandemic pushed organizations to rely on the vast amounts of data available to them to properly bolster resilience and adaptability.

From scenario planning to modeling enterprise risk and liquidity, regulatory compliance, and wealth management, access to accurate and current data can enable organizations to make smarter business decisions faster. Organizations need to streamline and accelerate operations by eliminating manual processes where possible and automating processes. Not only will this help increase speed and agility, but it will also reduce the delays and errors associated with manual processes. Finally, executives must look to ensure they have sufficient headroom, processing capabilities, and systems in place to foster agility and reliability and to respond to unexpected volatility.

Q3. What are the key challenges they face to keep pace with the ongoing market dynamics?

Joe Lichtenberg: The more data sources organizations have, the more complex their practices become. As data grows, so does the prevalence of data silos, making access to a single, trusted, and usable representation of the data challenging. Additionally, analytics are more difficult to perform with disorganized data, causing results to be less accurate, especially in regards to visibility, decision support, risk, compliance, and reporting. This issue is extremely important as organizations perform advanced analytics (e.g. machine learning), where access to large sets of clean, healthy data is required in order to build models that deliver accurate results.

Q4. Do you believe that capital markets are paying the price for delaying core investments into their data architectures?

Joe Lichtenberg: Established capital markets organizations have delayed some of their investments in data architectures for a variety of reasons, and that move has ultimately kept operational costs in check, as large changes could drastically disrupt workflows and set them back further in the short term. These organizations typically have well-established core infrastructures in place that have served them well. Over the years, they have been expanded on, which means introducing significant changes is a complicated and complex process. However, the combination of unprecedented volatility, rising customer expectations, and competition from niche financial technology companies – that are providing new services – are straining the limits of these systems and pushing established firms to modernize faster, using microservices, APIs, and AI. In fact, in some cases financial organizations are outsourcing non-core capabilities to FinTechs. The FinTechs, not burdened by legacy infrastructure, are able to innovate quickly but may not have the breadth, depth, or resilience of the established firms. As capital markets firms modernize their data architecture, replacing these systems can lead to greater downtime that can slow and stall modernization efforts. Implementing a data fabric enables organizations to modernize without costly rip-and-replace methods and empowers them to address siloed legacy applications while existing systems remain in place.

Q5. What is a “data fabric”?

Joe Lichtenberg: A data fabric is a reference architecture that provides the capabilities required to discover, connect, integrate, transform, analyze, manage, and utilize enterprise data assets. It enables the business to meet its myriad of business goals faster and with less complexity than legacy technologies. It connects disparate data and applications, including on-premises, from partners, and in the public cloud. An enterprise data fabric combines several data management technologies, including database management, data integration, data transformation, pipelining, API management, etc.

A data fabric addresses many of the limitations of data warehouses and data lakes and brings on a new wave of redesign to the modern data architecture to create a more dynamic system according to Gartner. A smart data fabric extends the capabilities to include a wide array of analytics capabilities – eliminating the complexity and delays associated with traditional approaches like data lakes that require moving data to yet another environment.

Q6. How can firms eliminate the friction that has been built up around accessing information, reduce the cost and complexity of data wrangling?

Joe Lichtenberg: Building a smart enterprise data fabric as a data layer to serve the organization enables firms to reduce complexity, speed development, accelerate time to value, simplify maintenance and operations, and lower total cost of ownership. Additionally, it enables organizations to execute analytics and programmatic actions on demand, by utilizing clean and current data that resides within the organization

Q7. What are your recommendations for firms that need to work with competitive insights?

Joe Lichtenberg: Competitive insights require access to accurate and current data that may reside in different silos in order to get maximum value. A data fabric provides the necessary access to this required data, but a smart data fabric takes this a step further. It incorporates a wide range of analytics capabilities, including data exploration, business intelligence, natural language processing, and machine learning, that enable organizations to visaulize, drill into and explore, and combine the data from different sources. This helps not just skilled developers, data stewards and analysts, but a wide range of users that are close to the business to gain new insights to guide business decisions and create intelligent prescriptive services and applications.

Q8. Where do you see Artificial intelligence and machine learning technologies playing a role for financial institutions?

Joe Lichtenberg: Advanced analytics are essential to the future success of financial institutions. AI, ML, and natural language processing (NLP) tools are already being utilized in various areas of financial services. Although some may argue that niche FinTechs lead the way in the adoption of these tools, established organizations are also utilizing AI and machine learning to increase wallet share, enhance customer engagement, and guide strategic decisions. However, these tools are only as effective as the data that powers them. Without healthy data, they can’t deliver accurate results. That is why it’s essential to place an emphasis on the quality of data that is collected and fed into these powerful tools.

Q9. The next generation of technology advancement must be built on strong data foundations. Artificial intelligence and machine learning require a high volume of current, clean, normalised data from across the relevant silos of a business to functions. How is it possible to deliver this data without requiring an entire structural rebuild of every enterprise data store?

Joe Lichtenberg: This is a common initiative for which a smart, enterprise data fabric is being used. But implementing such a reference architecture can be complex, requiring implementing and integrating many different data management technologies. A modern data platform that combines multiple layers and capabilities in a single product, reducing complexity by minimizing the number of products and technologies required, are helping to deliver critical business benefits with a simpler architecture, faster time to value, and lower total cost of ownership. For example, modern data platforms combine horizontally scalable, transactional and analytic database management capabilities, data and application integration functionality, data persistence, API management, analytics, machine learning, and business intelligence in a single product built from the ground up on a common architecture. Not only can the implementation of a smart enterprise data fabric with a modern data platform at the core help firms address current pain points, it accelerates the move toward a digital future without the costly rip-and-replace of their current operational infrastructure.

———————

Unknown-1

Joe Lichtenberg is responsible for product and industry marketing for data platform software at InterSystems. Joe has decades of experience working with various data management, analytics, and cloud computing technology providers.

Resources

Five Key Reasons to Invest in a Smart Data Fabric. InterSystems.

–  Accelerate Your Enterprise Data Initiatives with a Smart Data Fabric. InterSystems. (Download .PDF Link Registration required)

Related Posts

– On AI for Insurance and Risk Management. Interview with Sastry Durvasula. by Roberto V. Zicari on February 13, 2020

Follow us on Twitter: @odbmsorg

Nov 23 20

On Digital Transformation and Ethics. Interview with Eberhard Schnebel

by Roberto V. Zicari

” Whether an organization is transparent and fair is, first of all, a very emotional category. This is something that customers and strategic partners decide by their own judgment and feelings.
The emotional communication of these values is the central task. 
However, there must also be minimum standards. It must be possible to check whether digital processes are fair and transparent. An inspection can be carried out for these minimum standards either as certification or as an onsite inspection. This depends on the complexity of the systems.” — Eberhard Schnebel.

I have interviewed Eberhard Schnebel, Group Risk Management at Commerzbank AG, where he led the project “Data Ethics / Digital Ethics” to establish aspects of corporate ethics regarding digitalization, Big Data and AI in the financial industry. We talked about Digital Transformation and Ethics.

Stay safe!

RVZ

Q1. What is Digital Ethics?

Eberhard Schnebel: Digital Ethics is the communication of applying ethical or social ideas to digital technology and making them a reality with digital technology. In the past, this was done in the context of Technology Assessment. The social and technological aspects are strictly separated, and digital technology’s instrumental use is systematically assessed from an ethical perspective.
Today’s approach sees digital development as an integrated part of our social life. It is important to transfer our social ideas into this digital reality and understand which digital contexts we would like to change.
This new approach understands Digital Ethics as part of the digital transformation itself. Digital Ethics thus directly influences our life with digital technology.

Q2. Why should Data Ethics and Digital Ethics become a core element of a company’s digital strategy?

Eberhard Schnebel: Data, BigData, and digital technologies are changing the core elements of many companies’ business models. New communication and risk management tasks are emerging. These must be taken into account in many areas and professionally integrated into the organization.
Data Ethics and Digital Ethics create exactly this extension of organizational routines, which will contribute to the company’s success in the future. They create awareness of these elements’ new reality within the company – from marketing and customer communication to product design and data scientists.

Q3. In Europe, there are many legal precautions and increasing rules on how to adopt the issues of digital transformation and data ethics, e.g., EU-GDPR, EU-Trustworthy-AI, the recommendations of the Data Ethics Commission of the German government. What does this mean for an organization?

Eberhard Schnebel: Dealing with these developments and regulations is very important, given the rapid development of data technology and the rapid new possibilities for building up business models. There are always many grey areas where technology needs completely new solutions or where regulations have become blurred. Therefore, organizations must go far beyond a pure compliance system and consider integrating the requirements into a system as soon as they design it. In addition to pure “legal compliance”, organizations need an ethical policy that defines how employees must deal with fuzzy requirements. Such a policy helps to avoid various risks that would result from software development.
This is about creating the organizational conditions for living these ideas and increased awareness and readiness for social concerns in concrete terms.

Q4 How is it possible for companies to establish Data Ethics and Digital Ethics as an essential factor in their business model?

Eberhard Schnebel: Data Ethics and Digital Ethics define what needs to be done and what does not. On the other hand, they also define how products are set up and designed, and used. This is important to be able to sell everything to customers later in a way that satisfies them. When everyone – customers, employees, and management – has the same understanding of the company’s digital products, services, and tools.
Besides, the company must also make central organizational preparations, such as setting up a governance board or creating a framework for managing ethical risks.

Q5. What are the key challenges to achieving this in practice?

Eberhard Schnebel: Everyone who comes into contact with data or AI products should understand the framework of regulations applicable to a company’s data and AI ethics. We create a culture in which a data and AI ethics strategy can be successfully implemented and maintained. Therefore, it is necessary to educate and train employees, enabling them to raise important issues at key points and bring the main concerns to the appropriate advisory body.

Q6. What does “transparency” in the application practice mean for an organization?

Eberhard Schnebel: In an organization or company, “transparency” is very much about communication and less about the actual facts. Transparency is anything where others believe that they can sufficiently understand and comprehend your reasons and background. The creation of an emotional connection is the central point here. At the same time, it must be clear to the Governance Board and product designers and data scientists what it means to make this transparency factual.

Q7. What does “fairness” mean in the daily routine for an organization?

Eberhard Schnebel: Fairness is the most challenging term in Digital Ethics. Because we are seeing more and more transparently the tensions between the individual elements of digitization created by transparency, this must be offset by conveying fairness.
But fairness is also when new information asymmetries between companies and customers are used so that they always lead to an advantage for the customer. This advantage must, of course, be experienced by them exactly as such, which in turn requires very emotional communication. The tension between transparency and fairness, in turn, is a real ethical debate (see Nuffield Report).

Q8. Who will review whether an organization is transparent and fair?

Eberhard Schnebel: Whether an organization is transparent and fair is, first of all, a very emotional category. This is something that customers and strategic partners decide by their own judgment and feelings. The emotional communication of these values is the central task.
However, there must also be minimum standards. It must be possible to check whether digital processes are fair and transparent. An inspection can be carried out for these minimum standards either as certification or as an onsite inspection. This depends on the complexity of the systems.

Q9 Hardly any other technology is as cross-product and cross-industry as Artificial Intelligence (AI). Is it possible to achieve responsible use of AI?

Eberhard Schnebel: Yes, I think so, but this can only happen from within and not as a “regulation”. Just as the “honorable merchant” is supposed to lead to ensuring responsible use of the economy, an inner understanding of digitization can also emerge.
To achieve this, we must ensure that this technology’s initial fascination, which still embraces us, is transformed. It must give way to a more sober view of what can be expected and what it will be used for. We must find quality criteria that describe its “goodness”.
But we also need visible and comprehensible analyses of the systems, including discussing the ethical tasks involved. Here, a new system of ethical screening can provide the insights needed for societal evaluation and discussion.

Q10. In your new book (*), you talk about “Management by Digital Excellence Ethics”. What do you mean by this?

Eberhard Schnebel: In the end, three building blocks must come together in Digital Ethics:
1. Digital Ethics adjusted compliance system that ensures that minimum standards are met, and risks are avoided.
2. Analytical module that ensures the necessary transparency and flow of ethical information for system design.
3. Commitment of management and product designers to communicate this ethical content to customers and business partners.
Suppose the link can be made between digital technologies’ quality use, appropriate meaning, and communicating their benefits. In that case, this is Digital Excellence Ethics because it successfully translates the excellent use of digital technology to integrate social ideas into business models.

Q11 Do you have anything to add?

Eberhard Schnebel: Again, we want to encourage the connection between ethical instruments and ethical risk management to be communicative and not just organizational with this book. Those involved must be emotionally involved. If we want to find Digital Ethics “on the engine” in the end, then only as an inner ethical element, as mindfulness.

———————————————————–

Eberhard_Schnebel-35-a-705x705
Dr. habil. Eberhard Schnebel
Ethical Advisor. Frankfurt Big Data Lab, Goethe University Frankfurt, Germany.

Eberhard Schnebel is a philosopher, economist and theologian. He received his PhD in 1995 from the Faculty of Theology of the LMU in Munich and his habilitation in 2013 at the Faculty of Philosophy of the LMU. His work combines Theory of Action, System-Theory and Ethics as foundations for responsibility in business and management.
Eberhard Schnebel is teaching Business Ethics at Goethe University in Frankfurt since 2013. Since 2016 he is developing “Digital Ethics” to integrate ethical communication structures into design processes of digitization and Artificial Intelligence.

He is a member of the Executive Committee of the “European Business Ethics Network” (EBEN). This network is engaged in the establishment of a common European understanding of ethics as prerequisite for a converging European economy. He is also member of the research team on Z-Inspection®, working on assessing Trustworthy AI in practice.

Eberhard Schnebel works full-time at Commerzbank AG, Group Risk Management, where he led the project “Data Ethics / Digital Ethics” to establish aspects of corporate ethics regarding digitalization, Big Data and AI in the financial industry. Prior to that, he led the project “Business Ethics and Finance Ethics” to introduce ethics as a tool for increasing management efficiency and accountability.

Resources
(*) A Valuable Future. Digital Ethics. by Dr. Eberhard Schnebel and Thomas Szabo. digital excellence, July 2020

Ethical Implications of AI, Series of Lectures— Open to all- no fees. Videos, Slides, Reports/Papers classified by topics.

Z-Inspection®: A process to assess Trustworthy AI.

Related Posts

– On Digital Transformation, Big Data, Advanced Analytics, AI for the Financial Sector. Interview with Kerem Tomak. ODBMS Industry Watch, July 8, 2019.

Follow us on Twitter: odbmsorg

Nov 6 20

Quantum Computer Systems. Interview with Fred Chong and Yongshan Ding

by Roberto V. Zicari

” Quantum computing is incredibly exciting because it is the only technology that we know of that could fundamentally change what is practically computable, and this could soon change the foundations of chemistry, materials science, biology, medicine, and agriculture.” — Fred Chong.

I have interviewed Fred ChongSeymour Goodman Professor in the Department of Computer Science at the University of Chicago and Yongshan Ding,  PhD candidate in the Department of Computer Science at the University of Chicago, advised by Fred Chong. They just published an interesting book on Quantum Computer Systems.

RVZ

Q1. What is quantum computing?

Ding: Quantum computing is a model of computation that exploits the unusual behavior of quantum mechanical systems (e.g., particles at minimal energy and distance scales) for storing and manipulating information. Such quantum systems have significant computational potential, as they allow a quantum computer to operate on an exponentially large computational space, offering efficient solutions to problems that seem to be intractable in the classical computing paradigm.

Q2. What are the potential benefits of this new paradigm of computing?

Chong:  Quantum computing is incredibly exciting because it is the only technology that we know of that could fundamentally change what is practically computable, and this could soon change the foundations of chemistry, materials science, biology, medicine, and agriculture.

Quantum computing is the only technology in which every device that we add to a machine doubles the potential computing power of the machine. If we can overcome the challenges in developing practical algorithms, software, and machines, quantum computing could solve some problems where computation grows too quickly (exponentially in the size of the input) for classical machines.

In the short term, quantum computing will change our understanding of the aforementioned sciences that fundamentally rely on understanding the behavior of electrons. A classical computer uses an exponential number of bits (electrons) to model the positions of electrons and how they change. Obviously, nature only uses one electron to “model” each electron in a molecule. Quantum computers will use only a small (constant) number of electrons to model molecules

Q3. What is the current status of research in quantum computing?

Ding: It is no doubt an exciting time for quantum computing. Research institutions and technology companies worldwide are racing toward practical-scale, fully programmable quantum computers. Many others, although not building prototypes by themselves, are joining the force by investing in the field of quantum computing. Just last year, Google used their Sycamore prototype to demonstrate a first “quantum supremacy” experiment, performing a 200-second computation that would otherwise take days on a classical supercomputer[1],[2]. We have entered, according to John Preskill, the long-time leader in QC, a “noisy intermediate-scale quantum” (NISQ) technology era, in which non-error-corrected systems are used to implement quantum simulations and algorithms. In our book, we discuss several of the recent advances in NISQ algorithm implementations, software/hardware interface, and qubit technologies, and highlight what roles computer scientists and engineers can play to enable practical-scale quantum computing.

Q4. What are the key principles of quantum theory that have direct implications for quantum computing?

Chong:  Paraphrasing our research teammate, Peter Shor, quantum computing derives its advantage from the combination of three properties:  an exponentially-large state space, superposition, and interference.  Each quantum bit you add to a system increases the state of the machine by 2X, resulting in exponential growth with the number of qubits.  These states can exist simultaneously in superposition and manipulating these states creates interference in these states, resulting in patterns that can be used to solve problems such as the factoring of large numbers with Shor’s algorithm.

Q5. What are the challenges in designing practical quantum programs?

Chong: There are only a small number of quantum algorithms that have an advantage over classical computation.  A practical quantum program must use these algorithms as kernels to solve a practical problem.  Moreover, quantum computers can only take a small number of bits as input and return a small number of bits as output.  This input-output limitation constrains the kinds of problems we can solve.  Having said that, there are still some large classes of problems that may be solved by quantum programs, such as problems in optimization, learning, chemistry and physical simulation.

Q6. What are the main challenges in building a scalable software systems stack for quantum computing?

Ding: A quantum computer implements a fundamentally different model of computation than a modern classical computer does. It would be surprising if the exact design of a computer architecture would extend well for a quantum computer. The architecture of a quantum computer resembles a classical computer in the 1950s, where device constraints are so high that the full-stack sharing of information is required from algorithms to devices. Some example constraints include hardware noises, qubit connectivity/communication, no copying of data, and probabilistic outcomes. In the near-term, due to these constraints, it is challenging for a quantum computer to follow the modularity and layering models, as seen in classical architectures.

Q7. What kind of hardware is required to run quantum programs?

Chong:  There are actually several technologies that may run practical quantum programs in the future.The leading ones right now are superconducting devices and trapped ions.  Optical devices and neutral atoms are also promising in the future.  Majorana devices are a further future alternative that, if realized, could have substantially better reliability than our current options.

Q8. Quantum computers are notoriously susceptible to making errors (*). Is it possible to mitigate this?

Ding: Research into the error mitigation and correction of quantum computation has produced several promising approaches and motivated inter-disciplinary collaboration – on the device side, researchers have learned techniques such as dynamical decoupling and its generalizations, introducing additional pulses into the systems that filter noise from a signal; on the systems software side, a compiler tool-flow that is aware of device noise and application structure can significantly improve the program success rate; on the application side, algorithms tailored for a near-term device can circumvent noise by a classical-quantum hybrid approach of information processing. In the long term, when qubits are more abundant, the theory of quantum error correction allows computation to proceed fault tolerantly by encoding quantum state such that errors can be detected and corrected, similar to the design of classical error-correcting codes.

Q9. Some experts say that we’ll never need quantum computing for everyday applications. What is your take on this?

Chong:  It is true that quantum computing will likely take the form of hardware accelerators for specialized applications.  Yet some of these applications include general optimization problems that may be helpful for many everyday applications.  Finance, logistics, and smart grid are examples of applications that may touch many users every day.

Q10. There is much speculation regarding the cybersecurity threats of quantum computing (**). What is your take on this?

Chong: In the long term, quantum machines may pose a serious challenge to the basis of modern cryptography. Digital commerce relies upon public-key cryptography systems that use a “pseudo-one-way function.” For example, it should be easy to digitally sign a document, but it should be hard to reverse-engineer the secret credentials of the person signing from the signature. The current most practical implementation of a one-way function is to multiply two large prime numbers together. It is hard to find the two primes from their product. All known classical algorithms take time exponential in the number of bits in the product (RSA key). This is the basis of RSA cryptography. Large quantum computers (running Shor’s algorithm) could find these primes in n^3 time for an n-bit key.

This quantum capability would force us to re-invent our cryptosystems to use other means of encryption that are both secure and inexpensive. An entire field of “post-quantum cryptography” has grown to address this problem. Secure solutions exist that resist quantum attacks, but finding any that are as simple as the product of two primes has been challenging. On the positive side, quantum computers and quantum networks may also help with security, providing new means of encrypting and securely communicating data.

Q10. What is the vision ahead for quantum computing?

Chong: Practical quantum computation may be achievable in the next few years, but applications will need to be error tolerant and make the best use of a relatively small number of quantum bits and operations.   Compilation tools will play a critical role in achieving these goals, but they will have to break traditional abstractions and be customized for machine and device characteristics in a manner never before seen in classical computing.

Q11. How does this book differ from other Quantum Computing Books? 

Ding: This is the first book to highlight research challenges and opportunities in near-term quantum computer architectures. In doing so, we develop the new discipline of “quantum computing systems”, which adapts classical techniques to address the practical issues at the hardware/software interface of quantum systems. As such, this book can be used as an introductory guide to quantum computing for computer scientists and engineers, spanning a range of topics across the systems stack, including quantum programming, compiler optimizations, noise mitigation, and simulation of quantum computation.

 ———————————

Professor Fred Chong, April 3, 2019 at Crerar. (Photo by Jean Lachat)

Professor Fred Chong, April 3, 2019 at Crerar. (Photo by Jean Lachat)

Fred Chong is the Seymour Goodman Professor in the Department of Computer Science at the University of Chicago. He is also Lead Principal Investigator for the EPiQC Project (Enabling Practical-scale Quantum Computing), an NSF Expedition in Computing. Chong received his Ph.D. from MIT in 1996 and was a faculty member and Chancellor’s fellow at UC Davis from 1997-2005. He was also a Professor of Computer Science, Director of Computer Engineering, and Director of the Greenscale Center for Energy-Efficient Computing at UCSB from 2005-2015. He is a recipient of the NSF CAREER award, the Intel Outstanding Researcher Award, and 9 best paper awards. His research interests include emerging technologies for computing, quantum computing, multicore and embedded architectures, computer security, and sustainable computing. Prof. Chong has been funded by NSF, DOE, Intel, Google, AFOSR, IARPA, DARPA, Mitsubishi, Altera and Xilinx. He has led or co-led over $40M in awarded research, and been co-PI on an additional $41M.  

 yongshan_ding

Yongshan Ding is a PhD candidate in the Department of Computer Science at the University of Chicago, advised by Fred Chong. Before UChicago, he received his dual B.Sc. degrees in Computer Science and Physics from Carnegie Mellon University. His research interests are in the areas of computer architecture and algorithms, particularly in the context of quantum computing. His work spans broadly in the theory and application of quantum error correction, efficient and reliable quantum memory management, and optimizations at the hardware/software interface. 

Resources

9781681738666

– Quantum Computer Systems. Research for Noisy Intermediate-Scale Quantum Computers. Yongshan Ding, University of Chicago, Frederic T. Chong, University of Chicago, Morgan & Claypool Publishers, ISBN: 9781681738666 |  2020 | 227 Pages, Link to Web Site


Related Posts
– On using AI and Data Analytics in Pharmaceutical Research. Interview with Bryn Roberts. ODBMS Industry Watch. September 10, 2018

Quote from Industry:  

” What I’m particularly excited about just now is the potential of universal quantum computing (QC). Progress made over the last couple of years gives us more confidence that a fault-tolerant universal quantum computer could become a reality, at a useful scale, in the coming years. We’ve begun to invest time, and explore collaborations, in this field. Initially, we want to understand where and how we could apply QC to yield meaningful value in our space. Quantum mechanics and molecular dynamics simulation are obvious targets, however, there are other potential applications in areas such as Machine Learning. I guess the big impacts for us will follow “quantum inimitability” (to borrow a term from Simon Benjamin from Oxford) in our use-cases, possibly in the 5-15 year timeframe, so this is a rather longer-term endeavour.” — Dr. Bryn Roberts, Global Head of Operations for Roche Pharmaceutical Research & Early Development.

Follow us on Twitter @odbmsorg
Jul 23 20

Thirty Years C++. Interview with Bjarne Stroustrup

by Roberto V. Zicari

“If you keep your good ideas to yourself, they are useless; you could just as well have been doing crossword puzzles. Only by articulating your ideas and making them accessible through writing and talks do they become a contribution.” –Bjarne Stroustrup

Back in 2007 I had the pleasure to interview Bjarne Stroustrup, the inventor of  C++ programming language. Thirteen years later…, I still have the pleasure to publish an interview with Bjarne.

RVZ

Q1. You have learned the fundamentals of object-oriented programming from Kristen Nygaard co-inventor of the Simula object-oriented programming (together with Ole-Johan Dahl back in the 1960s) who often visited your university in Denmark. How did Kristen Nygaard influence your career path?

Bjarne Stroustrup:  Kristen was a very interesting and impressive character. He was, of course, highly creative, and also a giant in every way. For starters he was about 6’6” and apparently quite as wide. When so inspired, he could deliver crushing bear hugs. Having a discussion with him on any topic – say programming, crime fiction, or labor policies – was invariably interesting, sometime inspiring.

As a young Masters student, I met him often because my student office was at the bottom of the stairs leading to the guest apartment. Each month he’d come down from Oslo for a week or so. Upon arrival, he’d call to me (paraphrasing) “round up the usual suspects” and my job was then to deliver half-a-dozen good students and a crate of beer. We then talked – meaning that Kristen poured out information on a variety of topics – for a couple of hours. I learned a lot about design from that, and the basics of object-oriented programming; the Scandinavian school of OOP, of course, where design and modeling the real world in code play major roles.

Q2. In 1979, you received a PhD in computer science from the University of Cambridge, under the supervision of  David Wheeler. What did you learn from David Wheeler that was useful for your future work?

Bjarne Stroustrup:  David Wheeler was very much in a class of his own. His problem-solving skills and design abilities were legendary. His teaching style was interesting. Each week, I came to his office to tell him what great ideas I had had or encountered during the week. His response was predicable along the lines of “yes, Bjarne, that’s not a bad idea; in fact, we almost used that for the EDSAC-2.” That is, he’d had that idea about the time I entered primary school and rejected it in favor of something better. I hear that some students found that style of response hard to deal with, but I was fascinated because David then proceeded to clarify my original ideas, evaluate them in context and elaborate on their strengths, weaknesses, possible improvements, and alternatives. With a few follow-up questions from me, he’d continue discussing problems, solutions, and tradeoffs for an hour or more. He taught me a lot about how to explore design spaces and also how to explain ideas – always based on concrete examples. I found his formal lectures terminally boring, and I don’t think he liked giving them, his strengths were elsewhere.

On my first day in Cambridge, he asked me “what’s the difference between a Masters and a PhD?” I didn’t know. “If I have to tell you what to do, it’s a Masters” he said and proceeded to – ever so politely – indicate that a Cambridge Masters was a fate worse than death. I didn’t mind because – as he had probably forgotten – I had just completed a perfectly good Masters in Mathematics with Computer Science from the University of Aarhus. In the years he supervised me, I don’t think he gave me more than one single direct advice. On my last day before leaving Cambridge after completing my thesis, he took me out to lunch and said “you are going to Bell Labs; that’s a very good place with many excellent people, but it’s also a bit of a black hole: good people go in and are never heard from again, whatever you do, keep a high external profile.” That fitted perfectly with my view that if you keep your good ideas to yourself, they are useless; you could just as well have been doing crossword puzzles. Only by articulating your ideas and making them accessible through writing and talks do they become a contribution.

David had a great track record with both hardware and software. That appealed to me.

Both David Wheeler and Kristen Nygaard honest, kind, and generous people – people you could trust and who worked hard for what they believed to be important.

Q3. You are quoted saying, that you designed C++ back in 1979 to answer to the question “How do you directly manipulate hardware and also support efficient high-level abstraction?” Do you still believe this was a good idea? 

Bjarne Stroustrup:  Definitely! Dennis Ritchie famously distinguished between languages designed “to solve a problem” and languages designed “to prove a point.” Like C, C++ is of the former category. The borderline between software and hardware is interesting, challenging, constantly changing, and constantly increasing in importance. The fundamental idea of C++ was to provide support for direct access to hardware, based on C’s model and then to allow people to “escape” to higher levels of expression through (what became known as) zero-overhead abstraction. There seems to be a never-ending need for code in that design space. I started with C plus Simula-like classes and over the years improvements (such as templates) have greatly increased C++’s expressive power and optimizability.

Q4. Why did you choose C as a base for your work? 

Bjarne Stroustrup: I decided to base my new tool/language one something, rather than start from scratch. I wanted to be part of a technical community and not re-hash all the fundamental design decisions. I knew at least a dozen languages that gave flexibility and good access to hardware facilities that I could have built upon. For example, I was acquainted with Algol 68 and liked its type system, but it didn’t have much of an industrial community.
C’s support for static type checking was weak, but the local support and community was superb: Dennis Ritchie and Brian Kernighan were just across the corridor from me! Also, its way of dealing with hardware was excellent, so I decided base my work on C and add to it as needed, starting with function argument checking and classes with constructors and destructors.

Q5. You also wrote that one way of looking at C++ is as the result of decades of three contradictory demands: Make the language simpler! Add these two essential features now!! Don’t break (any of) my code!!! Can you please explain what do you mean with these demands? 

Bjarne Stroustrup:  Many people have very reasonable wishes for improvement, but often those wishes are contradictory and any good design must involve tradeoffs.

  • Clearly C++ has undesirable complexity and “warts” that we’d like to remove. I am on record saying that it would be possible to build a language 1/10th of the size of C++ (by any measure) without reducing its expressive power or run-time power (HOPL3). That would not be easy, and I don’t think current attempts are likely to succeed, but I consider it possible and desirable.
  • Unfortunately, achieving this reasonable aim would break on the order of half a trillion lines of code, outdate huge amounts of teaching material, and outdate many programmers’ hard-earned experience.
    Many, possibly even most, organizations would still find themselves dependent on C++ for many more years, possibly decades. Automatic and guaranteed correct source-to-source translation could ease the pain, but it is hard to translate from messy code to cleaner code and much crucial C++ code manipulates tricky aspects of hardware.
  • Finally, few people want just simplification. They also want new facilities, that will allow them to cleanly express something is very hard to express in C++. They want novel features that necessarily makes the language bigger.
  • We simply cannot have all we want. That should not make us despondent or paralyzed, though. Progress is possible, but it involves painful compromises and careful design. It is worth remembering that every long-lived and widely used language will contain feature that in retrospect could be seriously improved or replaced with better alternatives. It will also have a large code base that doesn’t live up to modern standards of design and implementation. This is an unavoidable price of success.

Q6. C++ 1979-2020. What are the main lessons you have learned in all these years?

Bjarne Stroustrup:  There are many lessons, so it is hard to pick a main one. I assume you mean programming language design lessons.

Fundamental decisions are important and hard to change. Once in real-world use, basic language decisions cannot be changed. Fashions are seductive and hard to resist, but change over timespans shorter than the lifetime of a language. It is important to be a bit humble and suspicious about one’s own certainties. Often, the first reasonable solution you find isn’t the best in the longer run. Stability over decades is a feature. You don’t know what people are going to use the language for, or how. No one language and no one programming style will serve all users well.

Complete type safety and complete general resource management have been ideals for C++ from the very beginning (1979). However, given the need for generality and uncompromising performance, these were ideals that could be approached only incrementally as our understanding and technology improved. Arbitrary C++ code cannot be guaranteed type- and resource-safe and we cannot modify the language to offer such guarantees without breaking billions of lines of code. However, we have now reached the point where we can guarantee complete type safety and resource safety by using a combination of guidelines, library support, and static analysis: The C++ Core Guidelines. I outlined the principles of the Core Guidelines in a 2015 paper. Currently a static analyzer supporting the Core Guidelines ships with Microsoft Visual Studio. I hope to see support for the Guidelines that is not part of a single implementation so that their use could become universal.

My appreciation of tool support has grown over the years. We don’t write programs in just a programming language, but in a specific tool chain and specific environment made up of libraries and conventions.
The C++ world offers a bewildering variety of tools and libraries. Many are superb, but there is no dominant “unofficial standards” so it is very hard to choose and to collaborate with people who made different choices. I hope for come convergence that would significantly help C++ developers and C++ teaching. My HOPL-4 paper,Thriving in a crowded and changing world: C++ 2006–2020, has a discussion of that.

Q7. Who is still using C++?

Bjarne Stroustrup:  More developers than ever. C++ is the basis of many, many systems and applications, including some of our most widely used and best-known systems. My HOPL-4 paper, Thriving in a crowded and changing world: C++ 2006–2020, has a discussion of that. Major users include Google, Facebook, the semiconductor industry, gaming, finance, automotive and aerospace, medicine, biology, high-energy physics, and astronomy. Much is, however, invisible to end users.

Developers are hard to count, but surveys say about 4.5 million C++ users, and increasing. I have even heard “5 millions.” We don’t really have good ways of counting. Many measures, such as Tiobe, count “noise”; that is, mentions on the Web, but one enthusiastic student posts much more than 200 busy developers of important applications.

Q8. In this time of Artificial Intelligence, is C++ still relevant? 

Bjarne Stroustrup:  Certainly! C++ is the basis of most current AI/ML. Most new automobile software is C++, as is much high-performance software. Whatever language you use for AI/ML, the implementation usually critically involves some C++ library, such as Tensorflow. A serious data science scientist expressed it like this: I spend 97% of my time writing Python and my computer uses 98.5% of its cycles running C++ to execute that.

Q9. What are in your opinion the most interesting programming languages currently available? 

Bjarne Stroustrup: Maybe C++. Many of the ideas that are driving modern language development comes from C++ or have been brought into the mainstream through C++: RAII for general resource management. Templates for generic programming. Templates and constexpr function for compile-time evaluation. Various concurrency mechanisms. In turn, C++ of course owes much to earlier languages and research. For future developments that will affect programming techniques, I’d keep an eye on static reflection.

Much interesting work is going on in functional languages and in “Scripting” (e.g., TypeScript).

Q10. Why did you decide to leave a full-time job in academia and join Morgan Stanley?

Bjarne Stroustrup:  There were a few related reasons.

Over a decade, I had done most of the things a career academic do: teaching undergraduates, teaching graduate students, graduating PhDs, curriculum planning, written textbooks (e.g. Programming — Principles and Practice Using C++ (Second Edition)), written conference and journal research papers (e.g., Specifying C++ Concepts), applied and received research grants, sat on university committees. It was no longer new, interesting, and challenging.

I felt that I needed to get back “to the coal face”, to industry, to make sure that my work and opinions were still relevant. My interests in scale, reliability, performance, and maintainability were hard to pursue in academia.

I felt the need to get closer to my family in New York City and in Europe.

Morgan Stanley was in New York City, had very interesting problems related to reliability and performance of distributed systems, large C++ code bases, and – a bit of a surprise to me given the reputation of the finance industry – many nice people to work with.

Q11. You are also a Visiting Professor in Computer Science at Columbia University. What is the key message you wish to give to young students?

Bjarne Stroustrup:  Our civilization depends critically on software. We must improve our systems and to do that we need to become more professional. That’s the same message I’d try to send to experienced developers, managers, and executive.

Also I talk about the design principles of C++ and show concrete examples of how they were put into practice over the decades. You cannot teach design in the abstract.

Qx Anything else you wish to add?

Bjarne Stroustrup:  Education is important, but not everyone who want to write software needs the same education. We should make sure that there is a well-supported path through the educational maze for people who will write our critical systems, the ones we rely upon for our lives and livelihoods. We need to strive for a degree of professionalism equal to what we see in the best medical doctors and engineers.
I wrote a couple of papers about that: What should we teach software developers? Why? And Software Development for Infrastructure .

——————————-

Bjoern

Bjarne Stroustrup is the designer and original implementer of C++ as well as the author of The C++ Programming Language (4thEdition) and A Tour of C++ (2nd edition), Programming: Principles and Practice using C++ (2nd  Edition), and many popular and academic publications.
Dr. Stroustrup is a Technical Fellow and Managing Director in the technology division of Morgan Stanley in New York City as well as a visiting professor at Columbia University. He is a member of the US National Academy of Engineering, and an IEEE, ACM, and CHM fellow. He is the recipient of the 2018 NAE Charles Stark Draper Prize for Engineering and the 2017 IET Faraday Medal. He did much of his most important work in Bell Labs.
His research interests include distributed systems, design, programming techniques, software development tools, and programming languages. To make C++ a stable and up-to-date base for real-world software development, he has been a leading figure with the ISO C++ standards effort for 30 years. He holds a master’s in Mathematics from Aarhus University and a PhD in Computer Science from Cambridge University, where he is an honorary fellow of Churchill College.

www.stroustrup.com

Related Posts

– 10+1 Questions on Innovation to Bjarne Stroustrup. ODBMS Industry Watch, November 13, 2007

– 2 More Questions to Bjarne Stroustrup: Locations, People and Innovation. ODBMS Industry Watch, December 10, 2007

Follow ODBMS.org on Twitter: @odbmsorg

##

Jun 8 20

Fighting Covid-19 with Graphs. Interview with Alexander Jarasch

by Roberto V. Zicari

“There are an enormous amount of applications that we can provide. Just to mention a view of them: Scanning literature and patents for genes, proteins, targets and drugs. Finding information in clinical trials, which drugs are used and what inclusion/exclusion criteria exist. Automatically finding and querying genes with their synonyms and connected gene function and in which tissues they are expressed.” –Alexander Jarasch

I have interview Dr. Alexander Jarasch, head of Data and Knowledge management department at the German Center for Diabetes Research (DZD). Alexander is a team member of  The CovidGraph Project.
CovidGraph is a non-profit collaboration of researchers, software developers, data scientists and medical professionals.
The aim of the project is to help researchers quickly and efficiently find their way through COVID-19 datasets and to provide tools that use artificial intelligence, advanced visualization techniques, and intuitive user interfaces.

RVZ

Q1. What is CovidGraph?

Alexander Jarasch: CovidGraph is a knowledge graph that connects text data from scientific literature and intellectual property with clinical trials, drugs and entities from biomedical research such as genes, proteins, their function and regulation.

Q2. What is the aim of this project?

Alexander Jarasch: Provide researchers with Covid-19 relevant information that is connected and easy to query. Usually, this work requires several manual, tedious and error prone work that can be speed up by using CovidGraph.

Q3. Who Is this project aimed at?

Alexander Jarasch: Researches, medical doctors regardless of their field of research. Since Covid-19 has many complications it is a valuable resource for getting connected data.

Q4. What data sets do you use?

Alexander Jarasch: Public datasets from literature, patents, case numbers, clinical trials, therapeutic targets, genes, transcripts, proteins, gene ontology, gene expression data, pathways. There are more to come. (See list at the end of the interview)

Q5. How do you check the quality and reliability of the COVID-19 datasets you use in your project?

Alexander Jarasch: The datasources are well established databases that are used and cited for years.

Q6. Why using Knowledge graphs?

Alexander Jarasch: Research data, especially in healthcare is highly connected, very heterogenous and often unstructured. Today, these datasources are siloed and connections between them isn’t available. Connecting the datasources enables a more comprehensive view on it. By the fact of connecting data in insights occur, that have been hidden before.

Q7. Which tools are you developing to explore papers, patents, existing treatments and medications around the family of the corona viruses?

Alexander Jarasch: One the one hand we provide user interfaces for interactive data browsing and querying. For example, users can use Linkurious , Graphiken, derive GmbH and Neo4j Bloom. On the other hand we develop a more specific UI for users from biomedical research together with yWorks.

Q8. What are the applications that The CovidGraph project provides?

Alexander Jarasch: There are an enormous amount of applications that we can provide. Just to mention a view of them: Scanning literature and patents for genes, proteins, targets and drugs. Finding information in clinical trials, which drugs are used and what inclusion/exclusion criteria exist. Automatically finding and querying genes with their synonyms and connected gene function and in which tissues they are expressed.

Q9. Who is maintaining the data stored in the Knowledge Graph? Is it centralized or distributed?

Alexander Jarasch: It’s maintained by a community of volunteers and data is stored on a publicly accessible server.

Q10. The COVID*Graph should provide the data basis for understanding the processes involved in a coronavirus infection. What did you learn so far?

Alexander Jarasch: In parallel to data integration and preliminary data analysis we found that Covid-19 is supposed to affect more than just lung cells. Researchers also support this finding and can be found in several articles. We found out that ACE2 (Angiotensin-converting enzyme 2) is the gene that is mentioned most in scientific articles. This seems obvious since this is the receptor the corona virus uses to access the cells.

Qx Anything else you wish to add?

Alexander Jarasch: We are a private-public partnership and volunteers from several companies working with graph technology. We are non-profit community and hope to support researchers and doctors to find a cure for Covid-19 / Sars-Cov-2 and related diseases.

——————————–
23858_7830
Dr. Alexander Jarasch is the head of Data and Knowledge management department at the German Center for Diabetes Research (DZD). His team supports scientists from basic research and and clinical research with IT solutions from data management to data analysis. New insights from diabetes research and its complications are stored in a knowledge graph connecting data from basic research, animal models and clinical trials.
Dr. Jarasch received his PhD in structural bioinformatics and biochemistry from Ludwig-Maximilians University (LMU) in Munich and has a master’s degree in bioinformatics from the LMU and the Technical University of Munich.
He completed his postdoctoral trainings on behalf of Evonik Industries AG and Roche Diagnostics GmbH.

Resources

The CovidGraph Project

Covidgraph.org  on GitHub

bioRxiv (pronounced “bio-archive”) is a free online archive and distribution service for unpublished preprints in the life sciences. It is operated by Cold Spring Harbor Laboratory, a not-for-profit research and educational institution. By posting preprints on bioRxiv, authors are able to make their findings immediately available to the scientific community and receive feedback on draft manuscripts before they are submitted to journals.

medRxiv (pronounced “med-archive”) is a free online archive and distribution server for complete but unpublished manuscripts (preprints) in the medical, clinical, and related health sciences.

The Lens is building an open platform for Innovation Cartography. Specifically, the Lens serves nearly all of the patent documents in the world as open, annotatable digital public goods that are integrated with scholarly and technical literature along with regulatory and business data. The Lens will allow document collections, aggregations, and analyses to be shared, annotated, and embedded to forge open mapping of the world of knowledge-directed innovation.

Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.

Gene integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide.

New UniProt portal for the latest SARS-CoV-2 coronavirus protein entries and receptors, updated independent of the general UniProt release cycle.

RefSeq: NCBI Reference Sequence Database. A comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein.

The Gene Ontology resourceThe mission of the GO Consortium is to develop a comprehensive, computational model of biological systems, ranging from the molecular to the organism level, across the multiplicity of species in the tree of life.

The GTEx Portal. The Genotype-Tissue Expression (GTEx) project is an ongoing effort to build a comprehensive public resource to study tissue-specific gene expression and regulation.

REACTOME is an open-source, open access, manually curated and peer-reviewed pathway database. Our goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic and clinical research, genome analysis, modeling, systems biology and education.

ClinicalTrials.gov is a resource provided by the U.S. National Library of Medicine.

COVID-19 Response United Nations

COVID-19 Resources Johns Hopkins University.

COVID-19 datasets

COVID-19 Open Research Dataset (CORD-19)

In response to the COVID-19 pandemic, the Allen Institute for AI has partnered with leading research groups to prepare and distribute the COVID-19 Open Research Dataset (CORD-19), a free resource of over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.
https://pages.semanticscholar.org/coronavirus-research

The Lens COVID-19 Datasets

The Lens has assembled free and open datasets of patent documents, scholarly research works metadata and biological sequences from patents, and deposited them in a machine-readable and explorable form.
https://about.lens.org/covid-19/

Ensembl Genome Browser

Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species. http://www.ensembl.org

NCBI Gene Database

Gene integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide. https://www.ncbi.nlm.nih.gov/gene

The Gene Ontology Resource

The Gene Ontology (GO) knowledgebase is the world’s largest source of information on the functions of genes. This knowledge is both human-readable and machine-readable, and is a foundation for computational analysis of large-scale molecular biology and genetics experiments in biomedical research. http://geneontology.org

2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE

This is the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL). https://github.com/CSSEGISandData/COVID-19

United Nations World Population Prospects 2019

The 2019 Revision of World Population Prospects is the twenty-sixth round of official United Nations population estimates and projections that have been prepared by the Population Division of the Department of Economic and Social Affairs of the United Nations Secretariat. https://population.un.org/wpp/

Follow ODBMS.org on Twitter: @odbmsorg

##

Apr 28 20

On Vertica 10.0 Interview with Mark Lyons

by Roberto V. Zicari

“Supporting arrays, maps and structs allows customer to simplify data pipelines, unify more of their semi-structured data with their data warehouse as well as maintain better real world representation of their data from relationships between entities to customer orders with item level detail. A good example is groups of cell phone towers that are used for one call while driving on the highway.” –Mark Lyons

I have interviewed Mark Lyons, Director of Product Management at Vertica. We talked about the new Vertica 10.0

RVZ

Q1. What is your role at Vertica?

Mark Lyons: My role at Vertica is Director of Product Management. I have a team of 5 product managers covering analytics, security, storage integrations and cloud.

Q2. You recently announced Vertica Version 10. What is special about this release?

Mark Lyons: Vertica 10.0 is a milestone release and special in many ways from Eon Mode improvements to TensorFlow integration for trained models, and the ability to query complex data types like arrays, maps and structs. This release delivers on all aspects of why customers choose Vertica including performance improvements and our constant dedication to keeping our platform in front of the competition from an architecture standpoint.

Q3. Specifically why and how did you improve Vertica in Eon Mode?

Mark Lyons: The Eon Mode improvements are a long list of features including faster elasticity with sub-clusters, stronger workload isolation between sub-clusters and more control over the depot for performance tuning. Also Eon Mode is now available on two new communal storage options Google Cloud Platform and Hadoop Distributed File System (HDFS) in addition to what we already support which is Amazon Web Services S3, Pure Storage Flash Blades and MinIO. From here we are working on adding Azure and Alibaba for public cloud options and expanding our on-prem options with other vendors that our customers have shown interest in like Dell/EMC ECS storage offering and others.

Eon mode is run worldwide by many of our largest customers in production at this point and if you are interested in learning more about the scale and flexibility I recommend looking at our case study with The Trade Desk. They now run 2 Eon Mode clusters both at 320 nodes and petabytes of data growing every day.

Q4. So, one of your key improvement is to support complex types to improve integration with Parquet. How will that benefit businesses?

Mark Lyons: We’ve had high performance query ability on Parquet data whether that is on HDFS or S3 for years including column pruning and predicatepushdown. We continue to invest in our Parquet integration since it is an important part of many organizations’ analytics & data lake strategy. Over the past couple of releases we’ve been building the ability to query complex data types while maintaining columnar execution and late materialization for high performance.

Supporting arrays, maps and structs allows customer to simplify data pipelines, unify more of their semi-structured data with their data warehouse as well as maintain better real world representation of their data from relationships between entities to customer orders with item level detail. A good example is groups of cell phone towers that are used for one call while driving on the highway. We have seen tremendous interest from our customers in this new functionality. We have been actively testing preview builds of Vertica 10 with many customers for querying maps, arrays and structs for some time now.

Q5. How will this new release benefit people using Vertica in Enterprise Mode, people who or aren’t even on the Cloud and have no plans to go to the Cloud?

Mark Lyons: Vertica Enterprise Mode benefits from all of the improvements to the query optimizer, execution engine, machine learning, complex data types and beyond since there is only one Vertica code base and the Eon mode differences are limited to communal storage and sub-clusters. Enterprise Mode is the traditional direct attached storage. Massively Parallel Processing (MPP), shared nothing architectures are still appropriate for many organizations that don’t have plans to move to public cloud and have traditional data center infrastructure. Vertica doesn’t restrict on-premises customers to only using shared storage options like Pure Storage Flash Blades or HDFS. With a Vertica license these customers have the flexibility to deploy where they want in whichever architecture fits today, and they can change in the future without any license cost.

Q6. What are the features you offer in your in-database machine learning in Vertica?

Mark Lyons: Vertica is not normally thought of as a data science platform coming out of the MPP Column Store RDBMS space but we have built functions to make the data science pipeline very easy. We offer functions from data loading a variety of formats, enrichment, preparation and quality functions for data transformation as well as algorithms to train, score and evaluate models. There’s a lot more than I can begin to cover here. To learn more I suggest reading about the Vertica machine learning capabilities here.

In Vertica 10.0 we’ve added the capabilities to import and export PMML models to support data scientist training models in other tools like Python or Spark or use cases where a model trained in Vertica should be pushed to the edge for evaluation in a complex event processing/streaming system. We also added TensorFlow integration to import deep learning models trained on GPUs outside of Vertica into the data warehouse for scoring and evaluation on new customer or device data as it arrives.

Q7. There are several companies offering data platforms for machine learning (e.g. Alteryx Analytics, H2O.ai, RapidMiner, SAS Enterprise Miner (EM), SAS Visual Analytics, Databricks Unified Analytics Platform, IBM SPSS, Microsoft Azure Machine Learning, Teradata Unified Data Architecture, InterSystems IRIS Data Platform). How does Vertica compare to the other data platforms offering machine learning features?

Mark Lyons: The Vertica differentiation compared to these other tools is all about bringing scale, concurrency, performance and operational ML simplicity to the story. All of the data science tools you mention have equivalent functions for data prep, modeling, algorithms etc. and with all of the work we have done the past 3+ years Vertica has much of the same functionality you find in those other tools except for the GUI. For that, many of our customers use Jupyter notebooks.

Vertica performance is achieved by first the MPP architecture and second by in-memory and auto spill to disk so we can handle the largest data sets without being limited by the compute power of a single node or by the memory available like many solutions are. The beauty of this is users do not have to be aware of this or do anything. It just works!  In addition to being able to train on trillions of rows and thousands of columns you get enterprise readiness with resource management between jobs, high concurrency to support many users on the same system at the same time, workload isolation to keep ML workloads from overwhelming other analytics areas, security built in with authentication & access policies, and compliance with logging/audit of all ML workflows.

Mark

Q8. What are the most successful use cases which use the Machine Learning features offered by Vertica?

Mark Lyons: We are solving all of the most common use cases from fraud detection to churn prevention to  cybersecurity threat analytics. Vertica brings a new level of scale and speed to allow for more frequent model re-training, and use of the entire dataset even if that is trillions of rows and PBs of data without data movement or down sampling.

Q9. Anything else you wish to add?

Mark Lyons: A few weeks ago we wrapped up our Vertica Big Data Conference 2020 and I recommend anyone interested in learning more to come and watch replays of the sessions.

———————-

Mark

Mark Lyons, Director of Product Management, Vertica
Mark leads the Vertica product management team. His expertise is in new product introductions, go-to-market planning, product roadmaps, requirement development, build/buy/partner analysis, new products, innovation and strategy.

Resources

Virtual Vertica Big Data Conference 2020

Related Posts

Top 10 Highlights from the Virtual Vertica BDC 2020

Follows us on Twitter: @odbmsorg

##