Skip to content

"Trends and Information on AI, Big Data, Data Science, New Data Management Technologies, and Innovation."

This is the Industry Watch blog. To see the complete ODBMS.org
website with useful articles, downloads and industry information, please click here.

Nov 2 21

On Designing and Building Enterprise Knowledge Graphs. Interview with Ora Lassila and Juan Sequeda

by Roberto V. Zicari

“The limits of my language mean the limits of my world.” – Ludvig Wittgenstein

I have interviewed Ora Lassila, Principal Graph Technologist in the Amazon Neptune team at AWS and Juan Sequeda, Principal Scientist at data.world.  We talked about knowledge graphs and their new book.

RVZ

Q1. You wrote a book titled “Designing and Building Enterprise Knowledge Graphs”. What was the main motivation for writing such a book?

Ora Lassila and Juan Sequeda:  We wanted to tackle the topic of knowledge graphs more broadly than just from the technology standpoint. There is more than just technology (e.g., graph databases) when it comes to successfully building a knowledge graph. 

Time and time again we see people thinking about knowledge graphs and jumping to the conclusion that they just need a graph database and start there. Not only is there more technology you need, but there are issues with people, processes, organizations, etc.

Q2. What are knowledge graphs and what are they useful for?

Ora Lassila and Juan Sequeda:  We see knowledge graphs as a vehicle for data integration and to make data accessible within an organization. Note that when we say “accessible data”, we really mean this: accessible data = physical bits + semantics. The semantics part is really important, since no data is truly accessible unless you also understand what the data means and how to interpret it. We call this issue the “knowledge/data gap”; Chapter 1 of our book gets deep into this.

You could say that knowledge graphs are a way to “democratize” data: make data more accessible and understandable to people who are not technology experts.

Q3. Why connecting relational databases with knowledge graphs?

Ora Lassila and Juan Sequeda:  Frankly, the majority of enterprise data is in relational databases, so this seemed like a very good way to scope the problem. At the beginning of our book we show examples of how data is connected today and frankly, it’s a pain. And it’s not just a technical pain, there are important social and organizational aspects to this.

Juan Sequeda:  Understanding the relationship between relational databases and the semantic web/knowledge graphs has been my quest since my undergraduate years. The title of my PhD dissertation is “Integrating Relational Databases with the Semantic Web”. Therefore I can say that this is a passion of mine. 

Q4. Does it make more sense to use a native graph database instead or a NoSQL database?

Ora Lassila and Juan Sequeda:  There is always the question “why use X instead of Y?”… and the answer almost always is “it depends”. We even bring this up in the foreword: As computer scientists we understand that there are many technologies that can be used to solve any particular problem. Some are easier, more convenient, and others are not. Just because you can write software in assembly language does not mean you shouldn’t seek to use a high-level programming language. Same with databases: find one that suits your purpose best.

Q5. What are the typical roles within an organization responsible for the knowledge graph?

Ora Lassila and Juan Sequeda:  Organizations really need to get into the mindset of treating data as a product. When you acknowledge this, you realize you need the roles for designing, implementing and managing products, in this case data products. We see upcoming roles such as data product managers and knowledge scientists (i.e. Knowledge Engineers 2.0). We get into this in Chapter 4 of our book.

Q6. Data and knowledge are often in silos. Sharing knowledge and data is sometimes hard in an enterprise. What are the technical and non technical reasons for that?

Ora Lassila and Juan Sequeda:  Technical problems are solvable, and many solutions exist. That said, we think knowledge graphs are really addressing this issue nicely.

The non-technical issues are an interesting challenge, and in many ways more difficult: people and process, organizational structure, centralization vs decentralization, etc. One specific issue that shows up all the time is this: If you want to share knowledge within a broader organization, you have to cross organizational boundaries, and that lands you on someone else’s “turf”. There is a great deal of diplomacy that is needed to tackle these kinds of issues. 

Q7. When is it more appropriate to use RDF graph technologies instead of native property graph technologies?

Ora Lassila and Juan Sequeda:  First, we object to the notion of “native” when it comes to property graphs, they are no more native than RDF graphs.

These are two slightly different approaches to building graphs. Ultimately, the question is not all that interesting. A more interesting question is: When should you use a graph as opposed to something else? If you do decide to use a graph, there are a lot of considerations and modeling decisions before you even come to the question of RDF vs. property graphs.

Of course, RDF is better suited to some situations (e.g., when you use external data, or have to merge graphs from different sources). Try using property graphs there and you merely end up re-inventing mechanisms that are already part of RDF. On the other hand, property graphs often appeal more to software developers, thanks to available access mechanisms and programming language support (e.g., Gremlin).

Q8. How can enterprises successfully adopt knowledge graphs to integrate data and knowledge, without boiling the ocean?

Ora Lassila and Juan Sequeda:  First of all, you can’t build enterprise knowledge graphs in a “boil the ocean” approach. No chance in hell. You first need to break the problem in smaller pieces, by business units and use cases. This ultimately is a people and process problem. The tech is already here.

That said, there is a certain “build it and they will come” aspect to knowledge graphs. You should think of them more as a platform rather than as an application. Start by knowing some use cases, and gradually generalize and widen your scope. But you need to be solving some pressing problems for the business. Spend time understanding the problems, the limitations of their current solutions (assuming they are somewhat viable) and finding a champion (i.e. “if you can solve this problem better/faster/etc, I’m all ears!”). Also try to avoid educating on the technology: Business units don’t care if their problem is solved with technology A, B or C… all they want is for their problem to be solved.

Q9. Knowledge graphs and AI. Is there any relationships between them?

Ora Lassila and Juan Sequeda:  Yes. Knowledge Graphs are a modern solution to a long-time (and in some ways, “ultimate”) goal in computer science: to integrate data and knowledge at scale. For at least the past half century, we’ve seen independent and integrated contributions coming from the AI community (namely knowledge representation, a subfield of classical AI) and the data management community.  See section 1.3 of the book.

Qx Anything else you wish to add?

Ora Lassila and Juan Sequeda:  We see a lot of what Albert Einstein gave as the definition of insanity: Doing the same thing over and over, and expecting different results. We need to do something truly different. But this is challenging for many reasons, not least because of this: 

“The limits of my language mean the limits of my world.” – Ludvig Wittgenstein

For example, if SQL is your language, it may be very hard for you to see that there are some completely different ways of solving problems (case in point: graphs and graph databases).

Another challenge is that there are hard people and process issues, but as technologists we are wired to focus on technology, and to seek how to scale and automate. 

Finally, we think the “graph industry” needs to evolve past the RDF vs. property graphs issue. Most people do not care. We need graphs. Period.

………………………………………..

Dr. Ora Lassila, Principal Graph Technologist in the Amazon Neptune team at AWS, mostly focusing on knowledge graphsEarlier, he was a Managing Director at State Street, heading their efforts to adopt ontologies and graph databases. Before that, he worked as a technology architect at Pegasystems, as an architect and technology strategist at Nokia Location & Commerce (aka HERE), and prior to that he was a Research Fellow at the Nokia Research Center Cambridge. He was an elected member of the Advisory Board of the World Wide Web Consortium (W3C) in 1998-2013, and represented Nokia in the W3C Advisory Committee in 1998-2002. In 1996-1997 he was a Visiting Scientist at MIT Laboratory for Computer Science, working with W3C and launching the Resource Description Framework (RDF) standard; he served as a co-editor of the RDF Model and Syntax specification.

Juan Sequeda, Principal Scientist at data.world.  He holds a PhD in Computer Science from The University of Texas at Austin. Juan’s goal is to reliably create knowledge from inscrutable data. His research and industry work has been on designing and building Knowledge Graph for enterprise data integration. Juan has researched and developed technology on semantic data virtualization, graph data modeling, schema mapping and data integration methodologies. He pioneered technology to construct knowledge graphs from relational databases, resulting in W3C standards, research awards, patents, software and his startup Capsenta (acquired by data.world). Juan strives to build bridges between academia and industry as the current co-chair of the LDBC Property Graph Schema Working Group, past member of the LDCB Graph Query Languages task force, standards editor at the World Wide Web Consortium (W3C) and organizing committees of scientific conferences, including being the general chair of The Web Conference 2023. Juan is also the co-host of Catalog and Cocktails, an honest, no-bs, non-salesy podcast about enterprise data.

Resources

Designing and Building Enterprise Knowledge Graphs Synthesis Lectures on Data, Semantics, and Knowledge August 2021, 165 pages, (https://doi.org/10.2200/S01105ED1V01Y202105DSK020) Juan Sequeda, data.world; Ora Lassila, Amazon 

Related Posts

Fighting Covid-19 with Graphs. Interview with Alexander Jarasch ODBMS Industry Watch, June 8, 2020

Follow us on Twitter: @odbmsorg

##

Sep 20 21

On Responsible AI. Interview with Kay Firth-Butterfield,World Economic Forum.

by Roberto V. Zicari

“I think that many companies need to understand that their customers are worried about the use of AI and then act accordingly. I believe they should set up ethics advisory boards and then follow the advice or internal teams to advise on what they should do and take that advise.”

–Kay Firth-Butterfield

I have interviewed Kay Firth-Butterfield, Head of Artificial Intelligence and member of the Executive Committee at the World Economic Forum. We talked about Artificial Intelligence (AI) and in particular, we discussed responsible AI,  trustworthy AI and AI ethics.

RVZ

Q1. You are the Head of Artificial Intelligence and a member of the Executive Committee at the World Economic Forum. What is your mission at the World Economic Forum? 

Kay Firth-Butterfield: We are committed to improving the state of the world. 

Q2. Could you summarize for us what are in your opinion the key aspects of the beneficial and challenging technical, economic and social changes arising from the use of AI? 

Kay Firth-Butterfield: The potential benefits of AI being used across government, business and society are huge. For example using AI to help find ways of educating the uneducated, giving healthcare to those without it and helping to find solutions to climate change. Both embodied in robots and in our computers it can help keep the elderly in their homes and create adaptive energy plans for air conditioning so that we use less energy and help keep people safe. Apparently some 8800 people died of heat in US last year but only around 450 from hurricanes. Also, it helps with cyber security and corruption. On the other side, we only need to look at the fact that over 190 organisations have created AI principles and the EU is aiming to regulate use of AI and the OHCHR has called for a ban on AI which affects human rights to know that there are serious problems with the way we use the tech, even when we are careful.

Q3. The idea of responsible AI is now mainstream. But why when it comes to operationalizing this in the business, companies are lagging behind? 

Kay Firth-Butterfield: I think they are worried about what regulations will come and the R&D which they might lose from entering the market too soon. Also, many companies don’t know enough about the reasons why they need AI. CEOs are not envisaging the future of the company with AI which, if available is often left to a CTO. It is still hard to buy the right AI for you and know whether it is going to work in the way it is intended or leave an organisation with an adverse impact on its brand. Boards often don’t have technologists and so they can help the CEO think through the use of AI for good or ill. Finally, its is hard to find people with the right skills. I think this may be helped by remote working when people don’t have to locate to a country which is reluctant to issue visas.

Q4. What is trustworthy AI? 

Kay Firth-Butterfield: The design, development and use of AI tools which do more good for society than they do harm.

Q5. The Forum has developed a board tool kit to help board member on how to operationalize AI ethics. What is it? Do you have any feedback on how useful is it in practice?

Kay Firth-Butterfield:  It provides Boards with information which allows they to understand how their role changes when their company uses AI and therefore gives them the tools to develop their governance and other roles to advise on this complex topic. Many Boards have indicated that they have found it useful and it has been downloaded more than 50,000 times.

Q6. Let´s talk about standards for AI. Does it really make sense to standardize an AI system? What is your take on this?

Kay Firth-Butterfield:  I have been working with the IEEE on standards for AI since 2015, I am still the Vice-Chair. I think that we need to use all types of governance for AI from norms to regulation depending on risk. Standards provide us with an excellent tool in this regard.

Q7. There are some initiatives for Certification of AI. Who has the authority to define what a certification of AI is about? 

Kay Firth-Butterfield:  At the moment there are many who are thinking about certification. There is not regulation and no way of being certified to certify! This needs to be done or there will be a proliferation and no-one will be able to understand which is good and which is bad. Governments have a role here, for example Singapore’s work on certifying people to use their Model AI Governance Framework.

Q8. What kind of incentives are necessary in your opinion for helping companies to follow responsible AI practices? 

Kay Firth-Butterfield:  I think that many companies need to understand that their customers are worried about the use of AI and then act accordingly. I believe they should set up ethics advisory boards and then follow the advice or internal teams to advise on what they should do and take that advise. In our Responsible Use of Technology work we have considered this in detail.

Q9. Do you think that soft government mechanisms would be sufficient to regulate the use of AI or would it be better to have hard government mechanisms? 

Kay Firth-Butterfield:  both

Q10. Assuming all goes well, what do you think a world with advanced AI would look like? 

Kay Firth-Butterfield:   I think we have to decide what trade offs of privacy we want to allow for humans to develop harnessing AI. I believe that it should be up to each of us but sadly one person deciding to use surveillance via a doorbell surveills many. I believe that we will work with robots and AI so that we can do our jobs better. Our work on positive futures with AI is designed to help us better answer this question. Report out next month! Meanwhile here is an agenda.

…………………………………………………………

Kay Firth-Butterfield is a lawyer, professor, and author specializing in the intersection of business, policy, artificial intelligence, international relations, and AI ethics. 

Since 2017, she has been the Head of Artificial Intelligence and a member of the Executive Committee at the World Economic Forum and is one of the foremost experts in the world on the governance of AI. She is a barrister, former judge and professor, technologist and entrepreneur and vice-Chair of The IEEE Global Initiative for Ethical Considerations in Artificial Intelligence and Autonomous Systems. She was part of the group which met at Asilomar to create the Asilomar AI Ethical Principles, is a member of the Polaris Council for the Government Accountability Office (USA), the Advisory Board for UNESCO International Research Centre on AI and AI4All

She regularly speaks to international audiences addressing many aspects of the beneficial and challenging technical, economic and social changes arising from the use of AI.

Resources

  1. Empowering AI Leadership: An Oversight Toolkit for Boards of Directors. World Economic Forum.
  2. Ethics by Design: An organizational approach to responsible use of technology.  White Paper December 2020. World Economic Forum.
  3. A European approach to artificial intelligence, European Commission.
  4. The IEEE Global Initiative for Ethical Considerations in Artificial Intelligence and Autonomous Systems

Related Posts

On Digital Transformation and Ethics. Interview with Eberhard Schnebel. ODBMS Industry Watch. November 23, 2020

On the new Tortoise Global AI Index. Interview with Alexandra Mousavizadeh. ODBMS Industry Watch,  April 7, 2021

Follow us on Twitter: @odbmsorg

##

Aug 27 21

On Managing Innovation. Interview with Jack Levis

by Roberto V. Zicari

“Early on I was a leader who acted like an architect.  I felt I needed to set direction and create plans and vision that my people could follow. I learned to be more of a caretaker.  I set general direction and give my people the support and resources they need.  I monitor progress and make adjustments as needed. — Jack Levis.

I have interviewed Jack Levis, Retired UPS Senior Director of Industrial Engineering. We talked about the main lessons learned in his long career at UPS. Very informative and full of wisdom.

RVZ

Q1. What are the key main lessons you learned in your 42 yrs 10 mos carrier at UPS? 

Jack Levis: A career can go by in the blink of an eye….   A career is a journey, not a destination.  I enjoyed every day I worked at UPS.  Well, almost every day.  

I came to work to try and make my organization better.  That being said, I also placed a priority on people over projects.  If you take care of people, they will take care of everything else.  Generally, people can accomplish more than you think.

That being said, I also learned that change and impact is much more difficult than it appears.  I have not found silver bullets.

But with the right people, with the right attitudes, and the right project, amazing things can happen.

After nearly 43 years, I think I accomplished more than I expected.  But when all is done, the awards, accolades, and successes are not what is remembered. 

It is the people.  I was fortunate to work with and for the best people.

Q2. What do you think are the essential ingredients to foster innovation and constant improvement?  

Jack Levis: First, understand the decisions that can be improved.  From there, work backward. 

  1. What information is needed to improve the decision?
  2. What tools are needed to provide the information?
  3. What data is needed to feed the tools?

As important as the technology, is making sure to understand and plan for deployment.  The best technology unused is worth zero.  Deployment needs to be planned way ahead of project completion. 

Too often, people think about building and deploying a tool.  We should be thinking about deploying impact.  There is a huge difference between the two and impact is what was promised.

Is innovation the great idea, or the execution of the idea?  Of course, you need both, but without the execution it’s  a moot point.

Therefore, have a focus on deployment and results!! 

Q3. You are quoted in an interview back in 2017 saying “Never assume you know the answers.” What is your take on this in retrospect? 

Jack Levis: This still holds true for me today with any project which could have significant impact.   

My motto is always, “if it were easy, it would have been done already”.  I look for what I don’t know.  What is the “gotcha” that keeps the tool from working and the impact from happening.

Often, the issue is in “hidden” business rules and “subjective” decisions.

For instance, an algorithm may have a function to reduce cost while meeting the stated business rules.  More often than not, there are unstated rules.  When these unstated rules are found late in the project, you can easily get into a game of “whack-a-mole”, adding rules one by one.

Similarly, sometime rules are just guidelines.  People deal with this better than computers.  Often these turn out to be “subjective” decisions.  For instance, UPS’ famous “no left turn” policy.  This is a guideline to not make unnecessary left turns. The hard part becomes defining “unnecessary”.

Finally, there is always the issue of data.  I have never been on a project where the data was already sufficient.  

So…  I go into a project thinking this will be harder than it seems, will have many unstated rules, and the data will not be sufficient.

Q4. You also mentioned that “Without understanding the importance of change management, new programs run the risk of becoming a “flavor of the month” “. Do you have any practical tips you can offer us on this? 

Jack Levis: People are more reluctant to change than you would expect.  They are invested in the old way of doing things.  I think this is human nature for any new innovation.

The key to change management is to get the field to “own” the new way of operating.

From my experience, I have learned to listen to what people talk about.  If a new tool is deployed and people talk about the same things they did before, the tool is a “flavor of the month.”  People will go back to the old way of doing things as soon as the spotlight is removed.

Therefore, I believe the way to get change to happen is to change conversations.  A great way to do this is with new metrics.  A new balanced scorecard that takes the new system into account.

A well thought out balanced scorecard that is linked to proper behavior and results can do wonders.  People are competitive and like being at the top of the ranking.  This can change conversations and therefore behavior.

With proper support from the top, a consistent message, continued focus, and training, the change will happen.  Especially if the stated goals are met.

Q5. You were the business owner and process designer for UPS’ Package Flow Technology suite of systems, which includes its award-winning delivery optimization, ORION (On Road Integrated Optimization and Navigation). These tools have been a breakthrough change for UPS, resulting in a reduction of 185 million miles driven each year and reducing costs by $350M to $400M annually. How did you manage to convince your management to support this transformation?  

Jack Levis: We knew that with such a large change as this, a nice presentation would not be enough.  Therefore, we built a prototype and tested it in real world operations.

When we went to the C-Suite, we showed them results, not just an idea.  Therefore, the system sold itself.

As with anything else, there was a healthy skepticism.  But because of the prototype, we could show them what we were thinking about.  We took the time to show almost every department head the system and concept.   

Senior management could see what we said made sense and that we could prove our enthusiasm.  A well thought out business case with facts to back it up can go a long way.

Selling this was not hard.  Achieving the results was difficult.

Q6. What are the secrets to build a strong enterprise data infrastructure ? 

Jack Levis: Start by understanding that existing data is rarely sufficient…..

From there, ensure the data is forward looking, not just historical transactions.  

Think in terms of a data model that describes a process and supports decisions.  Can the data answer what has happened, but also what should happen, and why?

When we built our first step of digital transformation, the first nine months of the project was nothing but data analysis.  We needed to come up with the data model that could answer all questions and make the right decisions.

The mindset needs to be that data is as important as the product.

Q7. What are the key elements  (not necessarily technical) that play a key role when making decisions and managing projects?  

Jack Levis: Running projects and programs is always a matter of determining priorities and managing risk.

 Projects and functionality is evaluated based on benefit, cost, risk and dependencies.  These items can be weighed to find solutions that are the highest benefit with the lowest cost, risk and dependencies.

Once a project begins, continual risk management is essential.  Managing risk means you are always looking ahead for what “might” happen.  This is much different than resolving an issue which has already happened.  It’s more productive to discuss tomorrow’s risks than yesterdays problems. 

Finally, and most important….  Communication, teamwork, and trust.

Projects are completed through people.  People want to work on good teams and trust their co-workers.  The best way to do this is through constant communication.

Q8. What makes a business future-ready? 

Jack Levis: It’s all about agility.  

A mature digital enterprise will have:

  1. High definition and forward looking data
  2. Front line technology to allow visualizing, interacting, and planning from the data
  3. Advanced analytics to optimize and assist in the decision making
  4. Strong leaders who understand change management

With these pieces in place, an organization can turn on a dime and adjust to changing conditions quickly.  

COVID has shown us this.  There are so many organizations that are failing customers because they still rely on human knowledge to do the job vs. digital operations.

Q9. What mistakes did you do in your career and what did you learn from them? 

Jack Levis: Many of the things I mention here are because they were “blind spots” for me along the way.

I used to think more about building a tool rather than deploying an impact.  This led to good tools that were not used to their potential.

Similar to above, I didn’t understand the importance of change management.  I didn’t focus enough on deployment.

Finally, early on I was a leader who acted like an architect.  I felt I needed to set direction and create plans and vision that my people could follow.

I learned to be more of a caretaker.  I set general direction and give my people the support and resources they need.  I monitor progress and make adjustments as needed.

As I said earlier, given a chance people can accomplish much more than you think.

………………………………………………………………………….

Jack Levis, Retired UPS Senior Director of Industrial Engineering, was responsible for the development of operational technology solutions.  These solutions required advanced analytics to reengineer processes, streamline the business, and maximize productivity. Jack was the business owner and process designer for UPS’ Package Flow Technology suite of systems which includes its award-winning optimization, ORION (On Road Integrated Optimization and Navigation). These tools have been a breakthrough change for UPS, resulting in a reduction of 225 million miles driven each year.  ORION alone is providing significant operational benefits to UPS and its customers.  UPS estimates that ORION alone is reducing costs by $500M to $600M per year.

Having earned his Bachelor of Arts in psychology, from California State University Northridge, Jack also holds a Master’s Certificate in Project Management from George Washington University.  He is a fellow of the Institute for Operations Research and Management Sciences (INFORMS), receiving their prestigious Kimball Medal and the President’s Award. Jack holds advisory council positions for multiple universities and associations, including the United States. Census Bureau Scientific Advisory Committee.  

Related Posts

Big Data at UPS. Interview with Jack Levis. ODBMS Industry Watch, August 1, 2017
10 Questions On Innovation to Alan Kay. ODBMS Industry Watch, April 5, 2006
On Innovation. Interview with Scott McNealy, ODBMS Industry Watch, July 2, 2018
10 Questions On Innovation to Philippe Kahn, ODBMS Industry Watch, February 5, 2006

Follow us on Twitter: @odbmsorg

##

Aug 13 21

On Time Series Databases. Interview with Ryan Betts

by Roberto V. Zicari

I have interviewed Ryan Betts, VP of Engineering at InfluxData. We talked about time series databases, InfluxDB and the InfluxData stack. RVZ

Time series databases have key architectural design properties that make them very different from other databases. These include time-stamped data storage and compression, data lifecycle management, data summarization, ability to handle large time-series-dependent scans of many records, and time-series-aware queries.“–Ryan Betts

Q1. What is time series data?

Ryan Betts: Time series data consists of measurements or events that are captured and analyzed, often in real time, to operate a service within an SLO, detect anomalies, or visualize changes and trends. Common time series applications include server metrics, application performance monitoring, network monitoring, and sensor data analytics and control loops. Metrics, events, traces and logs are examples of time series data.

Q2. What are the hard database requirements for time series applications?

Ryan Betts: Managing time series data requires high-performance ingest (time series data is often high-velocity, high-volume), real-time analytics for alerting and alarming, and the ability to perform historical analytics against the data that’s been collected. Additionally, many time series applications apply a lifecycle policy to the data collected — perhaps downsampling or aggregating raw data for historical use.  

With time series, it’s common to perform analytics queries over a substantial amount of data. Time series queries commonly include columnar scans, grouped and windowed aggregates, and lag calculations. This kind of workload is difficult to optimize in a distributed key value store. InfluxDB uses columnar database techniques to optimize for exactly these use cases, giving sub-second query times over swathes of data and supporting a rich analytics vocabulary.

While time series data is typically structured, it often has dynamic properties that aren’t well-suited to strict schema enforcement. Time series databases often specify the structure of data but allow schema-on-write. Another way of saying this is that time series databases often support arbitrary dimension data to decorate the contents of the fact table. This allows developers to create new instrumentation or collect metrics from new sources without performing frequent schema migrations. Document databases and column-family stores similarly allow flexible schema in their own contexts. The motivation with time series is similar — optimizing for developer productivity.

In addition to high-performance ingest, non-trivial analytics queries, and flexible schema, TSDBs also need to bridge real-time analytics to real-time action. There’s little point doing real-time monitoring if you can’t also automate real-time responses. So time series databases, like other real-time analytics systems, need to provide the analytics function and the ability to tie into real-time operations. That means integrating automated alerting, alarming, and API invocations with the query analytics performed for monitoring. 

Q3. How do you manage the massive volumes and countless sources of time-stamped data produced by sensors, applications and infrastructures?

Ryan Betts: The InfluxData stack is optimized for both regular (metrics often gathered from software or hardware sensors) and irregular time series data (events driven either by users or external events), which is a significant differentiator from other solutions like Graphite, RRD, OpenTSDB, or Prometheus. Many services and time series databases support only the regular time series metrics use case. 

InfluxDB lets users collect from multiple and diverse sources, store, query, process and visualize raw high-precision data in addition to the aggregated and downsampled data. This makes InfluxDB a viable choice for applications in science and sensors that require storing raw data.

At the storage level, InfluxDB organizes data into a columnar format and applies various compression algorithms, typically reducing storage to a fraction of the raw uncompressed size. Time series applications are “append-mostly”.  The majority of arriving data is appended.  Late arriving data and deletes occur with some frequency — but primarily writes result in appending to the fact table. The database uses a log structured merge tree architecture to meet these requirements. Deletes are recorded first as tombstones and are later removed through LSM compaction.

Q4. Can you give us some time series examples?

Ryan Betts: Time series data, also referred to as time-stamped data, is a sequence of data points indexed in time order. Time-stamped is data collected at different points in time.

These data points typically consist of successive measurements made from the same source over a time interval and are used to track change over time.

Weather records, step trackers, heart rate monitors, all are time series data. If you look at the stock exchange, a time series tracks the movement of data points, such as a security’s price over a specified period of time with data points recorded at regular intervals.

InfluxDB has a line protocol for sending time series data which takes the following form:

<measurement name>,<tag set> <field set> <timestamp>

The measurement name is a string, the tag set is a collection of key/value pairs where all values are strings, and the field set is a collection of key/value pairs where the values can be int64, float64, bool, or string. The measurement name and tag sets are kept in an inverted index which makes lookups for specific series very fast.

For example, if we have CPU metrics:

cpu,host=serverA,region=uswest idle=23,user=42,system=12 1549063516

Timestamps in InfluxDB can be by second, millisecond, microsecond, or nanosecond precision. The micro and nanosecond scales make InfluxDB a good choice for use cases in finance and scientific computing where other solutions would be excluded. Compression is variable depending on the level of precision the user needs.

Q5. The fact that time series data is ordered makes it unique in the data space because it often displays serial dependence. What does it mean in practice?

Ryan Betts: Serial dependence occurs when the value of a datapoint at one time is statistically dependent on another datapoint at another time.

Though there are no events that exist outside of time, there are events where time isn’t relevant. Time series data isn’t simply about things that happen in chronological order — it’s about events whose value increases when you add time as an axis. Time series data sometimes exists at high levels of granularity, as frequently as microseconds or even nanoseconds. With time series data, change over time is everything.

Q6. How is time series data understood and used?

Ryan Betts: Time series data is gathered, stored, visualized and analyzed for various purposes across various domains:

  1. In data mining, pattern recognition and machine learning, time series analysis is used for clustering, classification, query by content, anomaly detection and forecasting.
  2. In signal processing, control engineering and communication engineering, time series data is used for signal detection and estimation.
  3. In statistics, econometrics, quantitative finance, seismology, meteorology, and geophysics, time series analysis is used for forecasting.

Time series data can be visualized in different types of charts to facilitate insight extraction, trend analysis, and anomaly detection. Time series data is used in time series analysis (historical or real-time) and time series forecasting to detect and predict patterns — essentially looking at change over time. 

Q7. You also handle two other kinds of data, namely cross-section and panel data. What are these? How do you handle them?

Cross-sectional data is a collection of observations (behavior) for multiple entities at a single point in time. For example: Max Temperature, Humidity and Wind (all three behaviors) in New York City, SFO, Boston, Chicago (multiple entities) on 1/1/2015 (single instance).

Panel data is usually called cross-sectional time series data, as it is a combination of both time series data and cross-sectional data (i.e., collection of observations for multiple subjects at multiple instances).

This collection of data can be combined in a single series, or you can use Flux lang to combine and review this data to gather insights. 

Q8. There are several time series databases available in the market. What makes InfluxDB time series database unique?

Ryan Betts: When doing a comparison, the entire InfluxDB Platform should be taken into account. There are multiple types of databases that get brought up for comparison. Mostly, these are distributed databases like Cassandra or more time-series-focused databases like Graphite or RRD. When comparing InfluxDB with Cassandra or HBase, there are some stark differences. First, those databases require a significant investment in developer time and code to recreate the functionality provided out of the box by InfluxDB. Finally, they’ll have to create an API to write and query their new service.

Developers using Cassandra or HBase need to write tools for data collection, introduce a real-time processing system and write code for monitoring and alerting. Finally, they’ll need to write a visualization engine to display the time series data to the user. While some of these tasks are handled with other time series databases, there are a few key differences between the other solutions and InfluxDB. First, other time series solutions like Graphite or OpenTSDB are designed with only regular time series data in mind and don’t have the ability to store raw high-precision data and downsample it on the fly.

While with other time series databases, the developer must summarize their data before they put it into the database, InfluxDB lets the developer seamlessly transition from raw time series data into summarizations.

InfluxDB also has key advantages for developers over Amazon Timestream. Among them:

  • InfluxData is first and foremost an open source company. It is committed to sharing ideas and information openly, collaborating on solutions and providing full transparency to drive innovation.
  • Hybrid cloud and on-premises support. Every business has specific functionalities, and a hybrid cloud system offers the flexibility to choose services that best fit their needs, whether to support GDPR regulatory requirements or teams that are spread across multiple providers.

Q9. What distinguishes the time series workload?

Ryan Betts: Time series databases have key architectural design properties that make them very different from other databases. These include time-stamped data storage and compression, data lifecycle management, data summarization, ability to handle large time-series-dependent scans of many records, and time-series-aware queries.

For example: With a time series database, it is common to request a summary of data over a large time period. This requires going over a range of data points to perform some computation like a percentile increase this month of a metric over the same period in the last six months, summarized by month. This kind of workload is very difficult to optimize for with a distributed key value store. TSDB’s are optimized for exactly this use case giving millisecond- level query times over months of data.

Q10. Let’s talk about integrations. Software services don’t work alone. Suppose an application relies on Amazon Web Services, or monitors Kubernetes with Grafana or deploys applications through Docker, how easy is it to integrate them with InfluxDB?

Ryan Betts: InfluxData provides tools and services that help you integrate your favorite systems across the spectrum of IT offerings, from applications to services, databases to containers. We currently offer 200+ Telegraf plugins to allow these seamless integrations. Developers using the InfluxDB platform build their applications with less effort, less code, and less configuration with the use of a set of powerful APIs and tools. InfluxDB client libraries are language-specific tools that integrate with the InfluxDB API and can be used to write data into InfluxDB as well as query the stored data.

………………………………………………..

Ryan Betts is VP of Engineering at InfluxData. Ryan has been building high performance infrastructure software for over twenty years. Prior to InfluxData, Ryan was the second employee and CTO at VoltDB. Before VoltDB, he spent time building SOA security and core networking products. Ryan holds a B.S. in Mathematics from Worcester Polytechnic Institute and an MBA from Babson College.

Resources

influxdata/influxdb: Scalable datastore for metrics – GitHub

Introduction to Time Series Databases | Getting Started [1 of 7] YouTube

Related Posts

COVID-19 Tracking Using Telegraf and InfluxDB Dashboards

On Big Data Benchmarking. Q&A with Richard Stevens

The 2021 AI Index report (HAI Stanford University)

Follow us on Twitter: @odbmsorg

##

May 27 21

Why AI/Data Science Projects Fail. Interview with Joyce Weiner

by Roberto V. Zicari

“The most dangerous pitfall is when you solve the wrong problem.” –Joyce Weiner

I have interviewed Joyce Weiner, Principal AI Engineer at Intel Corporation.  She recently wrote a book on  Why AI/Data Science Projects Fail.

RVZ

Q1. In your book you start by saying that 87% of Artificial Intelligence/Big Data projects don’t make it into production, meaning that most projects are never deployed. Is this still actual?

Joyce Weiner: I can only provide the anecdotal evidence that it is still a topic of conversation at conferences and an area of concern. A quick search doesn’t provide me with any updated statistics. The most recent data point appears to be the Venture Beat reference (VB Staff, 2019). Back in 2019, Gartner predicted that “Through 2022, only 20% of analytic insights will deliver business outcomes.” (White, 2019)

Q2. What are the common pitfalls?

Joyce Weiner: I specifically address the common pitfalls that are in the control of the people working on the project. Of course, there can be other external factors that will impact a project’s success. But just focusing on what you can control and change:

  1. The scope of the project is too big
  2. The project scope increased in size as the project progressed (scope creep)
  3. The model couldn’t be explained
  4. The model was too complex
  5. The project solved the wrong problem

Q3. You mention five pitfalls, which of the five are most frequent?, and which one are the most dangerous for a project?

Joyce Weiner: Of the five pitfalls, scope creep has been the one I have seen the most in my experience. It’s an easy trap to fall into, you want to build the best solution and there is a tendency to add features when they come to mind without assessing the amount of value they add, or if it makes sense to add them right now. The most dangerous pitfall is when you solve the wrong problem. In that case, not only have you spent time and effort on a solution, once you have realized that you solved the wrong problem, you now need to go and redo the project to target the correct problem. Clearly, that can be demoralizing for the team working on the project, not to mention the potential business impact from the delay in delivering a solution.

Q4. You suggest five methods to avoid such pitfalls. What are they?

Joyce Weiner: The five methods I discuss in the book to avoid the pitfalls mentioned previously are:

  1. Ask questions – this addresses the project scope as well as providing information to decide on the amount of explainability required, and most importantly, ensures you are solving the correct problem.
  2. Get alignment – working with the project stakeholders and end users, starting as early as the project definition and continuing throughout the project, addresses problems with project scope and makes sure you are on track to solve the correct problem
  3. Keep it simple – this addresses model explainability and model complexity
  4. Leverage explainability – obviously directly related to model explainability, and addresses the pitfall of solving the wrong problem
  5. Have the conversation – continually discussing the project, expected deliverables, and sharing mock-ups and prototypes with your end users as you build the project addresses all 5 of the project pitfalls.

Q5. How do you apply and measure effectivnesss of these methods in practice?

Joyce Weiner: Well, the most immediate measurement is if you were able to deploy a solution into production. As a project progresses, you can measure things that will help you stay on track. For example, having a project charter to document and communicate your plans becomes a reference point as you build a project so that you recognize scope creep. A project charter is also useful when having conversations with project stakeholders to document alignment on deliverables.

Q6. Throughout your book you use the term “data science projects” as an all-encompassing term that includes Artificial Intelligence (AI) and Big Data projects. Don’t you think that this is a limitation to your approach?   Big Data projects might have different requirements and challenges than AI projects?

Joyce Weiner: Well, that is true Big Data projects do have additional challenges, especially around the data pipeline. The five pitfalls still apply, and those are the biggest challenges to getting a project into deployment based on my experience.

Q7. In your book you recommend as part of the project charter to document the expected return on investment for the project. You write that assessing the business value for your project will help get resources and funding. What metrics do you suggest for this?

Joyce Weiner: I propose several metrics in my book, which depend on the type of project you are delivering. For example, a common data science project is performing data analysis. Deliverables for this type of project are root cause determination, problem solving support, and problem identification. Metrics are productivity, which can be measured as time saved, time to decision which is how long it takes to gather the information needed to make a decision, decision quality, and risk reduction due to improved information or consistency in the information used to make decisions.

Q8. You also write that in acquiring data, there are two cases. One, when the data are available already either in internal systems or from external sources, and two, when you don’t have the data. How do you ensure the quality (and for example the absence of Bias) of the existing data?

Joyce Weiner: The easiest way to ensure you have high quality data is to automate data collection as much as possible. If you rely on people to provide information, make it easy for them to enter the data. I have found that if you require a lot of fields for data entry, people tend to not fill things in, or they don’t fill things in completely. If you can collect the data from a source other than a human, say ingesting a log file from a program, your data quality is much higher. Checking for data quality by examining the data set before beginning on any model building is an important step. You can see if there are a lot of empty fields or gaps, or one-word responses in free text fields – things that call the quality of the data into question. You also get a sense of how much data cleaning you’ll need to do.

Bias is something that you need to be aware of, for example, if your data set is made solely of failing samples, you have no information on what makes something good or bad. You can only examine the bad. Building a model from those data that “predicts” good samples would be wrong. I’ve found that thinking through the purpose of the data and doing it as early as possible in the process is key. Although it’s tempting to say, “given these data, what can I do?” it’s better to start from a problem statement and then ensure you are collecting the proper data related to the problem to avoid having a biased data set.

Q9. What do you do if you do not have any data?

Joyce Weiner: Well, it makes it very difficult to do a data science project without any data. The first thing to do is to identify what data you would want if you could have them. Then, develop a plan for collecting those data. That might be building a survey or that might mean adding sensors or other instruments to collect data.

Q10. How do you know when an AI/Big Data Project is ready for deployment?

Joyce Weiner: In my experience a project is ready for deployment when you have aligned with the end user and have completed all the items needed to deliver the solution they want. This includes things like a maintenance plan, metrics to monitor the solution, and documentation of the solution.

Q11. Can you predict if a project will fail after deployment?

Joyce Weiner: If a project doesn’t start well, meaning if you aren’t thinking about deployment as you build the solution, it doesn’t bode well for the project overall. Without a deployment plan, and without planning for things like maintainability as you build the project, then it is likely the project will fail after deployment. And by this I include a dashboard which doesn’t get used, or a model that stops working and can’t be fixed by the current team.

Q12. What measures do you suggest to monitor a BigData/AI project after it is deployed?

Joyce Weiner: The simplest measure is usage. If the solution is a report, are users accessing it? If it’s a model, then also adding predicted values versus actual measurements. In the book, I share a tool called a SIPOC or supplier-input-process-output-customer which helps identify the metrics the customer cares about for a project. Some examples are timeliness, quality, and support level agreements.

Q13. In your book you did not address the societal and ethical implications of using AI. Why?

Joyce Weiner: I didn’t address the societal and ethical implications of AI for two reasons. One, it isn’t my area of expertise. Second, it is such a big topic that it warrants its own book.

……………………………………

JoyceWeiner

Joyce Weiner is a Principal AI Engineer at Intel Corporation. Her area of technical expertise is data science and using data to drive efficiency. Joyce is a black belt in Lean Six Sigma. She has a BS in Physics from Rensselaer Polytechnic Institute, and an MS in Optical Sciences from the University of Arizona. She lives with her husband outside Phoenix, Arizona.

References

VB Staff. (2019, July 19). Why do 87% of data science projects never make it into production? Retrieved from VentureBeat: https://venturebeat.com/2019/07/19/why-do-87-of-data-science-projects-never-make-it-into-production/

White, A. (2019, Jan 3). Our Top Data and Analytics Predicts for 2019. Retrieved from Gartner: https://blogs.gartner.com/andrew_white/2019/01/03/our-top-data-and-analytics-predicts-for-2019/

51-5Qe+eEYL._SX404_BO1,204,203,200_

ISBN-13: 978-1636390383
ISBN-10: 1636390382
Publisher : Morgan & Claypool (December 18, 2020)

 

May 5 21

On Amazon DocumentDB. Interview with Barry Morris

by Roberto V. Zicari

“We built DocumentDB to implement the Apache 2.0 open source MongoDB APIs, specifically by emulating the responses that a MongoDB client expects from a MongoDB server. We don’t support 100 percent of the APIs today, but we do support the vast majority that customers actually use. We continue to work back from customers and support additional APIs that customers ask for.” — Barry Morris.

I have interviewed Barry MorrisGM ElastiCache, Timestream and DocumentDB at AWS. We talked about DocumentDB

RVZ.

Q1. AWS has many database services now. Why DocumentDB? Why did you build it?

Barry Morris: At AWS we believe customers should choose the right tool for the right job, and we don’t believe there’s a one size fits all database given the variety and scale of applications out there. Customers using our purpose-built databases don’t have to compromise on the functionality, performance, or scale of their workloads because they have a tool that is expressly designed for the purpose at hand. In the case of Amazon DocumentDB (with MongoDB compatibility) we offer a fast, scalable, highly available, and fully managed document database service that is purpose-built to store and query JSON.

We built Amazon DocumentDB because customers kept asking us for a flexible database service that could scale document workloads with ease. Amazon DocumentDB has made it simple for these customers to store, query, and index data in the same flexible JSON format that is generated in their applications, so it is highly intuitive for their developers. And it achieves this expressive document query support while also maintaining the high availability, performance, and durability required for modern enterprise applications in the cloud. Similar to our other AWS purpose-built database services, Amazon DocumentDB is fully managed, so customers can scale their databases with clicks in the console rather than executing a planning exercise that takes weeks.

Finally, because many of our customers with document database needs are already enthusiastic about and familiar with the MongoDB APIs, we designed Amazon DocumentDB to implement the Apache 2.0 open source MongoDB APIs. This allows customers to use their existing MongoDB drivers and tools with Amazon DocumentDB, and to migrate directly from their self-managed MongoDB databases to Amazon DocumentDB. It also gives them the freedom to migrate data in and out of DocumentDB without fear of lock-in.

Q2. Who is using DocumentDB and for what?

Barry Morris: Amazon DocumentDB is being used today by a wide variety of customers, from longstanding global enterprises like Samsung and Capital One, to digital natives like Rappi and Zulily, to financial organizations like FINRA. In addition, several products that Amazon customers use, such as the Fulfillment by Amazon (FBA) experience on Amazon.com, are powered by Amazon DocumentDB. We have customers in virtually every industry, from financial services to retail, from gaming to manufacturing, from media and entertainment to publishing, and more.

Many of our customers are software engineering teams who don’t want to deal with the “undifferentiated heavy lifting” of database administration, such as hardware provisioning, patching, setup, and configuration. These organizations would rather allocate their valuable engineering talent to building core application functionality, rather than deploying and managing MongoDB clusters. One of our customers, Plume, saved themselves the cost of “three to five approximately $150,000 Silicon Valley salaries” which both offset the managed service cost and allowed their team to focus on their core mission to deliver a superior wireless internet experience. Further, DocumentDB allows Plume to scale much more than their previous solution, with one of their clouds handling as many as 50,000 API calls per minute. You can read the full case study here.

The customer use cases are wide and many, given that document databases offer both flexible schemas and extensive query capabilities. Some of the more traditional use cases for document databases include catalogs, user profiles, and content management systems; and with the scale that AWS and Amazon DocumentDB provide, we are seeing customers deploy document databases for a much wider range of internet-scale use cases, including critical customer-facing e-commerce applications and production telemetry.

Q3. What has been the customer response?

Barry Morris: As with all AWS services, we work very closely with DocumentDB customers to ensure we are building a service that works backward from their needs. To date, the feedback we get is that customers are thrilled by DocumentDB’s ease of scaling, its fully managed capabilities, its natural integration with other AWS offerings, its durability and general enterprise-readiness, and its straightforward API compatibility with MongoDB. Of course, we are always working to add capabilities and features that are highly requested. For example, we just improved our MongoDB compatibility by adding support for frequently requested APIs such as renameCollection, $natural, and $indexOfArray. In the coming months, we also plan to release one of our most-requested features, Global Clusters, for customers with cross-region disaster recovery and data locality requirements. We also continue to bolster our MongoDB compatibility by adding support for the APIs that customers use the most.

Q4. What are the main design features of Amazon DocumentDB?

Barry Morris: Amazon DocumentDB has been built from the ground up with a cloud native architecture designed for scaling JSON workloads with ease. An essential design feature of DocumentDB is that it decouples compute and storage, allowing each to scale independently. Because storage and compute are separate, customers can add replicas without putting additional load on the primary. This allows you to easily scale out read capacity to millions of requests per second by adding up to 15 low latency read replicas across three AWS Availability Zones (AZs) in minutes. DocumentDB’s distributed, fault-tolerant, self-healing storage system auto-scales storage up to 64 TB per database cluster without the need for sharding, and without any impact or downtime to a customer’s application.

As I mentioned before, DocumentDB is built to be enterprise-ready. It provides strict network isolation with Amazon Virtual Private Cloud (VPC). All data is encrypted at rest with AWS Key Management Service (KMS) and encryption in transit is provided with Transport Layer Security (TLS). DocumentDB has compliance readiness with a wide range of industry standards, and automatically and continuously monitors and backs up to Amazon S3, which is highly durable.

Q5. When would you suggest to use DocumentDB vs another purpose-built database?

Barry Morris: At its core, DocumentDB is designed to store, index, and query rich and complex JSON documents with high availability and scalability. You can retrieve documents based on nested field values, join data across collections, and perform aggregation queries. So if you need schema flexibility and the ability to index and query rich structured and semi-structured documents, DocumentDB is a great choice. This is particularly true if you have JSON document workloads that are mission critical for your organization. A DocumentDB cluster provides 99.99% availability, can handle tens of thousands of writes per second and millions of reads per second, and supports up to 64 TiB of data. Finally, since DocumentDB supports MongoDB workloads and is compatible with the MongoDB API, it is a logical choice for MongoDB users who are looking to easily migrate to a fully managed database solution. Every use case is unique, and it is often a good idea to engage an AWS solution architect (SA) if you have questions about selecting the right database for your next application.

Q6. What are the key advantages of DocumentDB vs managing your own cluster?

Barry Morris: For many customers, fully managed is all about scale. We scale your database at the click of a button, saving you nights and weekends of scaling clusters manually. Customers don’t have to worry about provisioning hardware, running the service, configuring for high availability, or dealing with patching and durability. These concerns are shifted to AWS, so our customers can focus on their applications and innovate on behalf of their customers. Something as simple as backup and restore can be a drag on production. With DocumentDB, backup is on by default.

Cost is also a big concern when managing your own clusters. This can include the cost of labor resources, hardware investments, vendor software solutions, support costs, and more. Cost becomes very transparent with DocumentDB, as it offers pay-as-you-go pricing with per second instance billing. You don’t have to worry about planning for future growth, because DocumentDB scales with your business.

Q7. Tell me about “MongoDB compatibility” – what does that really mean in practice?

Barry Morris: That’s a great question and one we get a lot from customers. We built DocumentDB to implement the Apache 2.0 open source MongoDB APIs, specifically by emulating the responses that a MongoDB client expects from a MongoDB server. We don’t support 100 percent of the APIs today, but we do support the vast majority that customers actually use. We continue to work back from customers and support additional APIs that customers ask for. Because we offer MongoDB API compatibility, it’s straightforward to migrate from the MongoDB databases you’re managing on premises or in EC2 today to DocumentDB. Updating the application is as easy as changing the database endpoint to the new Amazon DocumentDB cluster.

Q8. Let’s hear about some exciting customer momentum. Can you please share some customer stories?

Barry Morris: We have a lot of them! Customers including BBC, Capital One, Dow Jones, FINRA, Samsung, and The Washington Post have shared their success stories with us. Recently, we’ve done some deeper-dive case studies with customers in a range of industries.

For example, Zulily presented their solution at AWS re:Invent 2020. The popular online retailer is using Amazon DocumentDB along with Amazon Kinesis Data Analytics to power its “suggested searches” feature. In this solution, Kinesis Data Analytics filters relevant events from clickstream analytics when a Zulily customer requests a search, a Lambda function performs a lookup for brands and categories relevant to those events, and the resulting enriched events — which populate the suggested search — are stored in DocumentDB. The feature has been a hit, with more than 75% of Zulily customers using suggested searches when they search the online store.

A customer story that is particularly compelling given recent events is Rappi. Rappi is a successful Colombian delivery app startup that operates in nine Latin American countries. The company had been rearchitecting their monolithic application into a more flexible, microservices-driven architecture to help it scale as it grew. As part of this modernization effort, the startup selected DocumentDB as a fully managed, purpose-built JSON database service to replace its self-managed MongoDB clusters, which were becoming unwieldy to manage at scale. When Covid-19 hit, the company faced an unprecedented surge in orders and deliveries. DocumentDB enabled them to handle the surge because, as a highly scalable service, it operated as normal despite the change in volume. Overall, Rappi decreased management and operational overhead by more than 50% using Amazon DocumentDB.

A final one I will mention is Asahi Shimbun, which is one of Japan’s oldest and largest-circulated newspapers. The company overhauled its digital app last year using AWS and selected Amazon DocumentDB as their content master database to store their articles. Since modernizing, Asahi Shimbun has seen a 30% reduction in monthly operation costs for extracting past articles and a 20% improvement in frequency of use for the app. This is one of many examples that showcase how essential AWS is for industries like publishing, retail, and banking that are evolving with new business models in the cloud.

You can peruse these and many other customer case studies in full on our website.

Q9. Anything else you wish to add?

Barry Morris: Over the last decade, JSON/document-based workloads have become one of the primary alternatives to relational approaches, for a wide range of applications with requirements for flexible data management. We expect this trend to keep growing, particularly with cloud-native applications, and we’re excited to offer DocumentDB as a tool in the toolkit of modern builders leveraging JSON. It’s been great to see DocumentDB support the needs not only of customers who are migrating their existing MongoDB workloads to the cloud, but also the builders who are creating modern applications and choosing DocumentDB as the right “purpose-built database” for their needs.

For anyone interested in learning more and getting hands-on with DocumentDB, we have a number of things coming up that may be of interest. We will be hosting two DocumentDB Focus Days, which are virtual workshops on best practices, in May and June. You can learn more and sign up on the registration page.  Finally, we have an ongoing Twitch series where our solution architects (SAs) dive deeper on DocumentDB functionality, which you can learn more about on the website. Our DocumentDB product detail page is the best place to start for a general overview of the service and steps to get started, and you can refer to the documentation for an in-depth developer guide.

………………………….

Picture1

Barry Morris, GM ElastiCache, Timestream and DocumentDB. As General Manager of ElastiCache, Timestream and DocumentDB, Barry manages a number of businesses in the AWS database portfolio.  He is focused on delivering value to AWS customers through trusted data management services, with a relentless commitment to database innovation.

Prior to joining AWS in 2020, his career includes over 20 years as the CEO of international technology companies, both private and public, including Undo.io, NuoDB, StreamBase, Headway, and IONA Technologies. Barry has also had leadership roles in PROTEK, Metrica, Lotus Development and DEC. 

Born in South Africa, Barry lived in England and Ireland before moving to Boston. He holds a Bachelor’s Degree (BA) in engineering from Oxford University and an Honorary Doctorate in Business Administration (DBA) from the IMCA.

Resources

– Get Started with Amazon DocumentDB

Related Posts

– From SQL to NoSQL. Interview with Carlos Fernández. by Roberto V. Zicari.ODBMS Industry Watch, April 30, 2021

Follow us on Twitter: @odbmsorg

Apr 30 21

From SQL to NoSQL. Interview with Carlos Fernández

by Roberto V. Zicari

“We like to say that we have the biggest database on companies and sole proprietors in Spain. We handle 7 million national economic agents, and the database undergoes more than 150,000 daily information updates. We have been active since 1992, so our historic file is massive. The database as a whole exceeds 40 Terabytes.” –Carlos Fernández

I have interviewed Carlos Fernández Deputy General Manager at INFORMA Dun & Bradstreet. We talked about their use of the LeanXcale database.

RVZ

Q1. Could you describe in a few words what Informa Dun & Bradstreet is and what its figures are?

Carlos Fernández: Informa D&B is the leading business information services company for customer and supplier acquisitions, analyses and management. We maintain this leadership in the three markets in which we compete: Spain, Portugal and Colombia.

We like to say that we have the biggest database on companies and sole proprietors in Spain. We handle 7 million national economic agents, and the database undergoes more than 150,000 daily information updates. We have been active since 1992, so our historic file is massive. The database as a whole exceeds 40 Terabytes.

To maintain and update this massive database, we invest 12 million euros every year in data and data handling procedures and systems, and we have 130 data specialists that take care of every single piece of information that we load into the database. Data quality, accuracy and timeliness as well as the coherence between different sources are essential for us.

Q2. I understand that Informa D&B has begun a profound update of its data architecture in order to continue being a market leader for another 10 years. What does the update consist of?

Carlos Fernández: We really began updating when gigabytes were insufficient for our needs. Now we see that terabytes will follow the same path. Petabytes are the future, and we need to be prepared for it. We usually say that when you need to travel to another continent, you need an airplane, not a car.

What does this mean in practical terms? Our customers are used to online responses to their needs. However, these needs have become more complex and require greater data depth.

If you are able to store hundreds of terabytes, use them very quickly and use complex analytic models to easily find the answer to your question, then you are in good shape.

To fulfill these requirements, a Data Lake orientation is really a must, and solutions like LeanXcale will become key factors in our new architectural approach.

Q3. You mentioned that you have found a new database manager, LeanXcale, to address the challenges for your data platform. What kind of database manager were you using before and why are you replacing it?

Carlos Fernández: INFORMA was, and still is, an “Oracle” company. Having said that, the more we began to move into a Data Lake design, the more new solutions and new names came into play. Mongo, Cassandra, Spark …

So, having come from an SQL-oriented environment featuring many lines of code, we wondered if we could fulfill our new requirements with the old technology. The answer to that query is a clear NO. Can we rewrite INFORMA as a whole? The answer is again NO. Can we meet our new requirements by increasing our computing capacity? Once more, the answer is NO.

We needed to be smart and find a solution that could bring positive outcomes in an affordable technical environment.

Q4. According to you, one of the main improvements has been the acceleration of the process through leveraging the interfaces of LeanXcale with NoSQL and SQL. Can you elaborate on how it helped you?

Carlos Fernández: As I mentioned before, we have quite challenging business and product performance requirements. On the other hand, business rules are also complex and difficult to rewrite for different environments.

Can we solve our issues without a huge investment in expensive servers? Can we also accommodate these requirements in a scalable fashion?

LeanXcale and its NoSQL and SQL interfaces were the perfect match for our needs.

Q5. What are the technical and business benefits of having a linear scaling database such as LeanXcale?

Carlos Fernández: We have many customers. They range from the biggest Spanish companies to small businesses and sole proprietors. They have completely different needs, but, at the same time, they share many requirements, with the main one being immediate response time.

Of course, the amount of data and model complexity involved in generating a response can vary quite a lot, depending on the size of the company and its portfolio.

Only by being able to accommodate such demands with a scalable solution can we provide the required services under a valid cost structure

Q6. How was your experience with LeanXcale as a provider?

Carlos Fernández: For us, this has been quite an experience. From the very beginning, the LeanXcale team acted as though they worked for INFORMA.

We started with a POC, and it was not an easy one. We had the feeling that we had the best parts of the company involved in the project. Well, not really the feeling since that really was the case.

The key factor, however, was the team’s knowledge, that is, the depth of their technical approach, the extent to which they understood our needs and their ability to reshape many aspects to make our requirements a reality.

Q7. You said that LeanXcale has a high impact on reducing total cost of ownership. Could you provide us with figures comparing it to the previous scenario?

Carlos Fernández: LeanXcale has reduced our processing time by more than 72 times over. The standard LeanXcale licensing and support price means savings of around 85%. In our case, we have maximized these savings by signing an unlimited License Agreement for the next five years.

Additionally, this improved performance reduces the infrastructure used in our hybrid cloud by the same proportion: 72 times over.

However, these savings are less crucial than the operational risk reduction and the enablement of new services. Being ready to react to any unexpected event quickly makes our business more reliable. New services will allow us to maintain our market leadership for the next decade.

Q8. How will this new technology affect the services offered to the customer?

Carlos Fernández: I think that we can consider two periods of time in the answer.

Right now, we are capable to improving our actual product range features. We can deliver updated external databases faster and more frequently and offer a better customer experience in many areas. We can provide more data and more complex solutions to a wider range of customers.

For the future, we are discovering new ways to design new products and services. When you break down barriers, new ideas come up quite easily. Our marketing team is really excited about the new capabilities we will have. I am sure that we will shortly see many new things coming from us.

QX. Anything else you wish to add?

Carlos Fernández: INFORMA D &B is a company that has put innovation at the top of its strategy. We never stop and will find new opportunities through using LeanXcale. We are very pleased and very sure that we will be a market leader for many years to come!

——————————————

Picture 1

Carlos Fernández holds a Superior Degree in Physics and an MBA from the “Instituto de Empresa” in Madrid. His professional career has included stints at companies such as Saint Gobain, Indra, Reuters and Fedea.

At the present time, he is Deputy General Manager at INFORMA and a member of the board of the XBRL Spanish Jurisdiction. In addition, he is a member of the Alcobendas City Council’s Open Data Advisory Board. This entity is firmly committed to continue advancing and publishing information in a reusable format to generate social and economic value.

Furthermore, he is a former member of various boards, including the boards of ASNEF Logalty, ASFAC Logalty and CTI.

He is a former member of GREFIS (Financial Information Services Group of Experts) and a current member of XBRL CRAS (Credit Risk Services), for which he is Vice President of the Technical Working Group. He is also a former member of the Information Technologies Advisory Council (CATI) and the AMETIC Association (Multi-Sector Partnership of Electronics, Communications Technology, Telecommunications and Digital Content Companies).

Resources

YouTube: LeanXcale’s success story on Informa D&B by Carlos Fernández Iñigo, CTO at Informa D&B

Related Posts

On Digital Transformation, Big Data, Advanced Analytics, AI for the Financial Sector. Interview with Kerem Tomak, by Roberto V. Zicari, ODBMS Industry Watch. July 8, 2019

Follow us on Twitter: @odbmsorg

##

Apr 21 21

On C++ Debugging. Interview with Greg Law

by Roberto V. Zicari

“Like it or not, debugging is part of programming. There is a lot of research and cool technology about preventing bugs (programming language features or design decisions that make certain bugs impossible) or catching bugs very early (through static or dynamic analysis or better testing), and all this is of course laudable and good stuff. But I’ve often been struck by how little attention is placed on making it easier to fix those bugs when they inevitably do happen.” — Greg Law

Q1: You are a prolific speaker at C++ conferences and podcasts. In your experience, who is still using C++?

Greg Law: C++ is used widely and its use is growing. I see a lot of C++ usage in Data Management, Networking, Electronic Design Automation (EDA), Aerospace, Games, Finance, etc.

It’s probably true that use of some other languages – particularly JavaScript and Python – is growing even faster, but those languages are weak where C++ is strong and vice versa. Go is growing a lot and Rust is getting a lot of attention right now and has some very attractive properties. 10-15 years ago, it felt almost like programming languages were “done” but these days, we’re seeing a lot of innovation both in terms of new or newish languages, and development of older languages. Even plain old C is seeing a bit of a resurgence. We are going to continue living in a multi-language world; I expect C++ to remain an important language for a long while yet.

Q2: In my interview with Bjarne Stroustrup last year, he spoke about the challenge of designing C++ in the face of contradictory demands of making the language simpler, whilst adding new functionality and without breaking people’s code. What are your thoughts on this?

Greg Law: I totally agree. I think all engineering is about two things – minimising mistakes and making tradeoffs (i.e. judgements). Mistakes might be a miscalculation when designing a bridge so that it won’t stand up or an off-by-one error in your program – those are clearly undesirable, we don’t want those. A tradeoff might be between how expensive the bridge is to build and how long it will last, or how long the code takes to write and how fast it runs.

But tradeoffs are relevant when it comes to reducing errors too – what price should we pay to avoid errors in our programs? How much extra time are we prepared to spend writing or testing it to get the bugs out? How far do we go tracking down those flaky 1-in-a-thousand failures in the test-suite? Are we going to sacrifice runtime performance by writing it in a higher-level and less error-prone language? Alternatively, we could choose to make that super-clever optimisation about which it’s hard to be confident it is correct today and even harder to be sure it will remain correct as the code around it changes; but is the runtime performance gain worth it, given the uncertainty that has been introduced? It’s counterintuitive, but actually there is an optimal bugginess for any program – we inevitably trade off cost of implementation and performance against potential bugs.

It’s probably fair to say however that most programs have more bugs than is optimal! I think it’s also true that human nature means we tend to under-invest in dealing with the bugs early, particularly flaky tests. We always feel “this week is particularly busy, I’ll part that and take a look next week when I’ll have a bit more time”; and of course next week turns out to be just as bad as this week.

Q3: I understand Undo helps software engineering teams with debugging complex C/C++ code bases. What is the situation with debugging C/C++? What are you seeing on the ground?

Greg Law: Like it or not, debugging is part of programming. There is a lot of research and cool technology about preventing bugs (programming language features or design decisions that make certain bugs impossible) or catching bugs very early (through static or dynamic analysis or better testing), and all this is of course laudable and good stuff. But I’ve often been struck by how little attention is placed on making it easier to fix those bugs when they inevitably do happen. The situation is not unlike medicine in that prevention is better than cure, and the earlier the diagnosis the better; but no matter what we do, we will always need cure (unlike medicine we have the balance wrong the other way round – in medicine we spend way too much on cure vs prevention!).

It’s all about tradeoffs again. All else being equal, we’d ensure there are no bugs in the first place; but all else never is equal, and how high a price can we afford on prevention? And actually if you make diagnosis and fixing cheaper, that further reduces how much you need to spend on prevention.

The harsh reality is that close to none of the software out there today is truly understood by anyone. Humans just aren’t very good at writing code, and economic pressure and other factors mean we add and fix tests until our fear of delivering late outweighs our fear of bugs. This is compounded as code ages; people move on from the project, bugs get fixed by adding a quick hack, further increasing the spaghettification. Like frogs in boiling water, we’ve kind of become so used to it that we don’t notice how awful it is any more!

People routinely just disable flaky failing tests because they can’t root-cause them. Over a third of production failures can be traced back directly or indirectly to a test that was failing and was ignored.

Q4: You have designed a time travel debugger for C/C++. What is it for?

Greg Law: Debugging is really answering one question: “what happened?”. I had certain expectations for what my code was going to do and all I know is that reality diverged from those expectations. Traditional debuggers are of limited help here – they don’t tell you what happened, they just tell you what is happening right now. You hit a breakpoint, you can look around and see what state everything is in, and either it looks all good or you can see something wrong. If it’s good, set another breakpoint and continue. If it’s bad… well, now you want to know what happened, how it became bad. The odds of breaking just at the right point and stepping your code through the badness are pretty long. So you run again, and again, if you’re lucky vaguely the same thing happens each time so you can home in on it; if not, well… you’re in trouble.

With a time travel debugger like UDB, it’s totally different – you see some piece of state is bad, you can just go backwards to find out why. Watchpoints (aka data breakpoints) are super powerful here – you can watch the bad piece of data and run backwards and have the debugger take you straight to the line of code that last modified it. We have customers who have been trying to fix something for literally years who with a couple of watch + reverse-continue operations had it nailed in an hour.

Time travel debuggers are really powerful for any bug where a decent amount of time passes between the bug itself and the symptoms (assertion failure, segmentation fault, bad results produced). They are particularly useful when there is any kind of non-determinism in the program – when the bug only occurs one time in a thousand and/or every time you run the program it fails at a different point in or a different way. Most race conditions are examples of this; so are many memory or state corruption bugs. It can also help to diagnose complex memory leaks. Most leak detectors or static analysis help with the trivial issues( say you returned an error and forgot to add a free) but not the hard ones (for example when you have a reference counting bug and so the reference never hits zero and the resources don’t get cleaned up).

This new white paper provides more insight into what kind of bugs time travel debugging helps with *. It’s not uncommon for software engineers to spend half their time debugging, so it’s a must-read for anyone who wants to increase development team productivity.

By the way, Time Travel Debugging is also sometimes known as Replay Debugging or Reverse Debugging.

Q5: Since you say it lets you see what happened, could it help with code exploration too?

Greg Law: Funny you say that. This is a use case it wasn’t initially designed for, but many engineers are using it to explore unfamiliar codebases they didn’t write. They use it to observe program behaviour by navigating forwards and backwards in the program’s execution history, examine registers to find the address of an object etc. They say there’s a huge productivity benefit in being able to go backwards and forwards over the same section of code until you fully understand what it does. Especially as you’re trying to understand a certain piece of code, and there are often millions of lines you don’t care about right now, it’s easy to get lost. When that happens you can go straight back to where you were and continue exploring.

Debugging is about answering “what did the code do” (ref. cpp.chat podcast on setting a breakpoint in the past **); but there are other activities that involve asking that same question. As I say, most code out there is not really understood by anyone.  

Q6: What are your tips on how to diagnose and debug complex C++ programs?

Greg Law: The hard part about debugging is figuring out the root cause. Usually, once you’ve identified what’s wrong, the fix is quite simple. We once had a bug that sunk literally months of engineering time to root cause, and the fix was a single character – that’s extreme but the effect it’s illustrating is very common.

Identifying the problem is an exercise in figuring out what the code really did as opposed to what you expected. Somewhere reality has diverged from your expectations – and that point of divergence is your bug. If you’re lucky, the effects manifest soon after the bug – maybe a NULL pointer is dereferenced and you needed a check for NULL right before it. But more often that pointer should never be NULL, the problem is earlier.

The answer to this is multi-pronged:

1. Liberal use of assertions to find problems as close to their root cause as possible. I reckon that 50% of assert fails are just bogus assertions, which is annoying but cheap to fix because the problem is at the very line of code that you notice. The other 50% will save you a lot of time.

2. If you see something not right, do not sweep it under the carpet. This is sometimes referred to as ‘smelling smoke’. Maybe it’s nothing, but you better go and look and see if there’s a fire. When you’re smelling smoke, you’re getting close to the root cause. If you ignore it, chances are that whatever the underlying cause of the weirdness is, it will come back and bite you in a way that gives you much less of a clue as to what’s wrong, and it’ll take you a lot longer to fix it. Likewise don’t paper over the cracks – if you don’t understand how that pointer can be NULL, don’t just put a check for NULL at the point the segv happened.

This most often manifests itself in people ignoring flaky test failures. 82% of software companies report having failing tests that were not investigated that went on to cause production failures *** (the other 18% are probably lying!). Working in this way requires discipline – following that smell of smoke or fixing that flaky test that you know isn’t your fault will be a distraction from your proximate goal. But when something is not right, or not understood, ignoring it now is going to cost you a lot of time in the long run.

3. Provide a way to know what your code is really doing. The trendy term is observability. This can be good old printf or some more fancy logging. An emerging technique is Software Failure Replay, which is related to Time-Travel Debugging. Here you record the program execution (a failed process), such that a debugger can be pointed at the execution history and you can go back to any line of code that executed and see full program state. This is like the ultimate observability. Discovering where reality diverged from your expectations becomes trivial.

————————————-

Greg Law Headshot 2018

Dr Greg Law is the founder of Undo, the leading Software Failure Replay platform provider. Greg has 20 years’ experience in the software industry prior to founding Undo and has held development and management roles at companies, including Solarflare and the pioneering British computer firm Acorn. Greg holds a PhD from City University, London, and is a regular speaker at CppCon, ACCU, QCon, and DBTest.

Resources

* White Paper: Increase Development Productivity with Time Travel Debugging

** cpp.chat podcast – Setting a Breakpoint in the Past

*** Freeform Dynamics Analyst Report – Optimizing the software supplier and customer relationship

Related Posts

Thirty Years C++. Interview with Bjarne Stroustrup. by Roberto V. Zicari.ODBMS Industry Watch. July 23, 2020

Follow us on Twitter: @odbmsorg

 

Apr 7 21

On the new Tortoise Global AI Index. Interview with Alexandra Mousavizadeh.

by Roberto V. Zicari

“I think the conversation is really about a shift of where funding is coming from. Governments are spending far less than the big tech platforms. What this tells us about who owns the direction of travel of AI is fascinating. Are we now in a position where the power that the public sector was able to deploy in the past is massively outgunned by private companies and their R&D budgets?” — Alexandra Mousavizadeh.

This is my follow up interview with Alexandra Mousavizadeh,  Partner at Tortoise Media. We talked about the new version of the Global AI Index.

RVZ

Q1. In December 2020, Tortoise Media launched the Global AI Index to benchmark nations on their level of investment, innovation and implementation of artificial intelligence. Since then, your team has been working on expanding the Index. What are the main results of this new Index?

Alexandra Mousavizadeh: The most striking result is China’s rapid improvement. Although the gap between the top three is significant (the US is still streets ahead of China, and the UK lags far behind in third place) across our 143 indicators, China has made gains. That rise is mostly due to a serious boost in research: yet another Chinese university joined the Times’ list of top 100 computer science universities; the total number of citations from high-achieving Chinese computer science academics jumped by 67 per cent over the course of the year; the number of Chinese academic AI papers accepted by the IEEE – a body which sets AI standards and also publishes a number of influential AI journals – now out-do those by US academics by a factor of seven; and China overtook the US in terms of AI patents granted around two years ago and has been pulling further ahead ever since.

China is also pulling ahead in the roll-out of supercomputers, with almost twice as many super computers as the US, demonstrating its growing threat to the US’ AI supremacy.

The UK has slipped in some key metrics, and its lead over its closest competitors has narrowed. Although a new AI strategy has now been announced, it’s not been published yet. We have also seen British slippage across several different key parts of the framework; universities, supercomputing, research, patents and diversity of specialists.

Q2. What has been added to the Global AI Index of this year?

Alexandra Mousavizadeh: We’ve added Armenia, Bahrain, Chile, Colombia, Greece, Slovakia, Slovenia and Vietnam: each of these countries has recently published a strategic approach to artificial intelligence on a national level, and therefore ‘qualified’ for assessment under The Global AI Index framework. That brings the number of countries assessed to 62 overall, up from 54 last year.

We’ve also developed all-new national AI dashboards: policy makers can monitor their national AI activity across all 143 indicators in real-time. These dashboards, the first of their kind, can also simulate the impact of policies via the target-setting feature, which calculates hypothetical ranks and scores based on chosen policy targets.

Q3. What new metrics did you introduce and why?

Alexandra Mousavizadeh: We’ve added a range of new metrics to deepen our measurements across many of the pillars of the index. In Talent, we have incorporated data provided by Coursera, showing the level of enrollment and activity on online learning courses specific to ‘artificial intelligence’ and ‘machine learning’. This data-set fills in gaps in countries like India, China and Russia – where our other metrics were not as comprehensive – by acting as a proxy for the level of online learning taking place.

We’ve also refined some of our existing metrics to increase the accuracy of our data. We’ve replaced the Open Data Barometer data-set, which was outdated in many respects, with measures from the OECD OURdata Index to better reflect the level of open data use and suitability. The new source is much more recent and relevant. We’ve also made our measures of 5G implementation more granular, reflecting the actual level of supported networks in a given country. This is a crucial leading indicator for the capacity to adopt artificial intelligence more widely in a given country,  so this change offers a lot more clarity in the Infrastructure pillar of the index.

In our forthcoming update for the index in May, we’ll be adding more indicators on diversity issues. These will complement a component of the index that deals with regulation and ethics.

Q4. Who are the biggest risers on the Index?

 Alexandra Mousavizadeh: Israel has surged up the rankings from 12th to 5th place. It improved its talent rank, with more R package downloads, an increase in stack overflow questions and answers, and a rise in GitHub commits. These are important indicators because they speak to developments in a country’s coding community beyond the formal education sector. It also still has the highest number of startups as a proportion of the population – 3 AI startups for every 100,000 people (compared with 5 startups for every million people in the US). It’s an impressive feat for a small country, but its rise is driven in part by the number of proportional, or intensity-based, metrics in the index (which favour small countries), and partly by changes in our methodology that more accurately capture the number of developers and other specialists on social media.

Also of note is Finland, which has procured a pre-exascale supercomputer and substantially increased its coding activity. Use of Python, R and commits on GitHub have grown, as well as our measures of GitHubs Stars and Stack Overflow Questions/Answers. This is a notable result of Finland’s ongoing focus on skills development, driven by both government strategy and an excellent ecosystem, including the Finnish AI Accelerator, the Tampere AI Hub and the AI Academy at University of Turku. That complements exciting tech startups, including Rovio, Supercell, and CRF Health.

 The Netherlands is the biggest riser; helped by a slow-down in some of the countries that previously ranked above them (Japan), The Netherlands have accelerated their coding and development activities and have high scores on the proportional Talent measures i.e. Number of Data Scientists per capita.

Q5. How has the pandemic influenced the global development of AI in the world? Do you have any insights to share on this?

Alexandra Mousavizadeh: It is difficult to untangle the exact effect of COVID on the global development of AI – a lot of the data is yet to come in. One area that has definitely suffered is start-ups. This time last year, the UK had 529 AI startups listed on Crunchbase, but it’s now dropped to 338. Other countries have seen similar collapses in startup numbers. There are some counterintuitive results: the 62 countries in our Index attracted similar overall levels of private funding for AI as in 2019. Several countries have seen a fall in investment, but these don’t necessarily match those that have been worst affected by the pandemic. Both the US and China, for instance, have seen a similar drop-off in funding, despite the US having a much worse pandemic overall.

The pandemic has also created challenges to which some countries have responded with AI. Coronavirus has become a focal point for Israel’s AI entrepreneurs, with several Israeli companies emerging as front-runners in areas like diagnostics, disease management and monitoring systems. Vocalis Health, for example, was launched by the defence ministry and is aiming to diagnose Covid effectively based on people’s speaking patterns.

Q6. The lack of transparency is a limiting factor for the effectiveness of the Index. For example in the case of Russia, much of its AI spending may be going to military purposes – and you can’t track it.  What is your take on this?

Alexandra Mousavizadeh: It’s true that there is AI spending and research that we can’t track, especially in countries that are less transparent, like Russia. We only use one proprietary data source for the Index – the Crunchbase API –  and the vast majority of the rest of our information is open source. Government spending on clandestine AI activity represents a small proportion of funding and progress in the AI space. Even if a country is spending quite large sums opaquely on covert AI, that spending is siloed, and often won’t contribute to a country’s overall progress in AI.

Q7. Isn’t this also the case for most of the other counties such as China, but even the USA?  They do not necessarily reveal their use of AI for military purposes.

 Alexandra Mousavizadeh: I think the conversation is really about a shift of where funding is coming from. Governments are spending far less than the big tech platforms. What this tells us about who owns the direction of travel of AI is fascinating. Are we now in a position where the power that the public sector was able to deploy in the past is massively outgunned by private companies and their R&D budgets? Amazon spends roughly ten times as much on R&D and infrastructure as DARPA’s total budget ($36bn and $3.5bn in 2019 respectively). The UK’s new research agency, ARIA, is set to have just £800m over the course of the next parliament.

Another related issue is that of selective publication. It’s well within a company’s rights to not release research that it carries out – but when that research has the potential to create massive public goods, it’s concerning that decisions are made in private by unaccountable tech companies.

Q9. How would you characterize the current geo-politics of artificial intelligence?

Alexandra Mousavizadeh: The countries that get on top of AI will accrue enormous benefits very quickly – not just in efficiency gains and cost reductions, but in transformative technologies that will increase almost every aspect of their global competitiveness. Although a lot of research that we’re currently seeing is business, not state-led, states can move rapidly if they have to.

Conventional wisdom says it’s a competitive environment, with some saying that we’re seeing a rise in AI nationalism. But the different gains and losses made by nations in different factors and metrics on the index show a complex picture of states investing in different areas. The World Economic Forum’s ‘Framework for Developing a National Artificial Intelligence Strategy’ highlights the collaborative role that many governments are aiming to play; co-designing, rather than merely responding to, technological change across multiple sectors. And many of the factors we track don’t stick to one country – talent crosses borders, and Github is global. A lot of the most interesting developments are iterated with open-source technology. So in many ways, the geo-politics of AI can be one of mutual benefit rather than a zero-sum environment.

It’s more complicated than any one story; but one narrative that we’re definitely seeing is two superpowers, the US and China, establishing themselves as dominant in the space. Then there are a number of specialist smaller states that could have a significant role to play in terms of standard-setting, and specialist AI in areas of national comparative advantage – take the UK and medtech for example. Finally, there are countries that want to be involved but aren’t currently in a place where their strategy (or level of investment) matches their ambitions. Those countries need to get serious, quickly.

AI is an accelerant – we run the risk of seeing clusters of AI excellence that exacerbate divides within and between nations, compounding existing inequalities and leaving those without skills and capital behind.

Q10. Carissa Veliz mentioned that “to ensure ethical behaviour around AI, certain behaviours should be banned by law” . Do you plan to include policy regulations in the future version of the Index?

Alexandra Mousavizadeh: The dashboards we developed have a set of implicit policy recommendations sitting behind the indicators; governments can see how their rankings would improve by using these features. But with regard to ethics, we spent hundreds of hours with the team discussing how regulation and ethics could feed into the index and concluded that it needed an in depth examination which warranted its own investigation. We are now doing that work. The May update to the index will contain a section on policy regulation, so stay tuned.

Having said this, there are some metrics in the Index which do address ethical considerations – such as the diversity of researchers in STEM subjects. This section will also be expanded in May.

Qx Anything else you wish to add?

Alexandra Mousavizadeh: We love hearing from our readers at Tortoise; the point is that we include them as part of the conversation. Our AI Sensemaker newsletter contains our and their thinking, keeping you posted on the latest developments in AI, and comes out once a fortnight. You can sign up here.  To get more involved, you can also apply to join our AI Network: it’s a global community of experts, policy makers, and business leaders, who take part in monthly round tables. These cutting-edge conversations set the pace for all things AI.

————————-

Alexandra

Alexandra Mousavizadeh is a Partner at Tortoise Media, running the Intelligence team which develops indices and data analytics. Creator of the recently released Responsibility100 Index and the new Global AI Index. She has 20 years’ experience in the ratings and index business and has worked extensively across the Middle East and Africa. Previously, she directed the expansion of the Legatum Institute’s flagship publication, The Prosperity Index, and all its bespoke metrics based analysis & policy design for governments. Prior roles include CEO of ARC Ratings, a global emerging markets based ratings agency; Sovereign Analyst for Moody’s covering Africa; and head of Country Risk Management, EMEA, Morgan Stanley.

Resources

The Global AI Index

The Global AI Index Methodology Report

The Tortoise Global AI Summit: the Readout. Thursday 10 December 2020

Related Posts

On The Global AI Index. Interview with Alexandra Mousavizadeh. ODBMS Industry Watch.by Roberto V. Zicari on January 18, 2020

Follow us on Twitter: @odbmsorg

##

Mar 18 21

On the Challenges Facing Financial Institutions. Interview with Joe Lichtenberg

by Roberto V. Zicari

“There are three factors C-suite executives need to consider when addressing operational resilience: the need to make better business decisions faster; improved automation and the elimination of manual processes; and the ability to respond to unexpected volume and valuation volatility.” –Joe Lichtenberg.

I have interviewed Joe Lichtenberg, responsible for product and industry marketing for data platform software at InterSystems.

RVZ

Q1. What are the main challenges financial institutions are facing right now?

Joe Lichtenberg: As financial services organizations are pushed to rapidly adapt due to the pandemic, they also want to gain a competitive edge, deliver more value to customers, reduce risk, and respond more quickly to the needs of distributed businesses. To not only stand out, but ultimately survive, financial services organizations have relied on their digital capabilities. For instance, many have adapted faster than anticipated and found ways to supplement traditional face-to-face customer service. As the volume of complex data grows and the need to use data for decision-making accelerates, it is becoming more difficult to reach their business goals and deliver differentiated service to customers at a faster rate.

Q2. What do you suggest to C-suite executives that could help them re-evaluate their operational resilience in light of increasing volumes and volatility, and the shift to a remote working environment (especially due to the COVID-19 crisis)?

Joe Lichtenberg: There are three factors C-suite executives need to consider when addressing operational resilience: the need to make better business decisions faster; improved automation and the elimination of manual processes; and the ability to respond to unexpected volume and valuation volatility. Executives need to prioritize their organizations’ ability to access and process a single representation of accurate, consistent, real-time and trusted data. The volatility and uncertainty fueled by the pandemic pushed organizations to rely on the vast amounts of data available to them to properly bolster resilience and adaptability.

From scenario planning to modeling enterprise risk and liquidity, regulatory compliance, and wealth management, access to accurate and current data can enable organizations to make smarter business decisions faster. Organizations need to streamline and accelerate operations by eliminating manual processes where possible and automating processes. Not only will this help increase speed and agility, but it will also reduce the delays and errors associated with manual processes. Finally, executives must look to ensure they have sufficient headroom, processing capabilities, and systems in place to foster agility and reliability and to respond to unexpected volatility.

Q3. What are the key challenges they face to keep pace with the ongoing market dynamics?

Joe Lichtenberg: The more data sources organizations have, the more complex their practices become. As data grows, so does the prevalence of data silos, making access to a single, trusted, and usable representation of the data challenging. Additionally, analytics are more difficult to perform with disorganized data, causing results to be less accurate, especially in regards to visibility, decision support, risk, compliance, and reporting. This issue is extremely important as organizations perform advanced analytics (e.g. machine learning), where access to large sets of clean, healthy data is required in order to build models that deliver accurate results.

Q4. Do you believe that capital markets are paying the price for delaying core investments into their data architectures?

Joe Lichtenberg: Established capital markets organizations have delayed some of their investments in data architectures for a variety of reasons, and that move has ultimately kept operational costs in check, as large changes could drastically disrupt workflows and set them back further in the short term. These organizations typically have well-established core infrastructures in place that have served them well. Over the years, they have been expanded on, which means introducing significant changes is a complicated and complex process. However, the combination of unprecedented volatility, rising customer expectations, and competition from niche financial technology companies – that are providing new services – are straining the limits of these systems and pushing established firms to modernize faster, using microservices, APIs, and AI. In fact, in some cases financial organizations are outsourcing non-core capabilities to FinTechs. The FinTechs, not burdened by legacy infrastructure, are able to innovate quickly but may not have the breadth, depth, or resilience of the established firms. As capital markets firms modernize their data architecture, replacing these systems can lead to greater downtime that can slow and stall modernization efforts. Implementing a data fabric enables organizations to modernize without costly rip-and-replace methods and empowers them to address siloed legacy applications while existing systems remain in place.

Q5. What is a “data fabric”?

Joe Lichtenberg: A data fabric is a reference architecture that provides the capabilities required to discover, connect, integrate, transform, analyze, manage, and utilize enterprise data assets. It enables the business to meet its myriad of business goals faster and with less complexity than legacy technologies. It connects disparate data and applications, including on-premises, from partners, and in the public cloud. An enterprise data fabric combines several data management technologies, including database management, data integration, data transformation, pipelining, API management, etc.

A data fabric addresses many of the limitations of data warehouses and data lakes and brings on a new wave of redesign to the modern data architecture to create a more dynamic system according to Gartner. A smart data fabric extends the capabilities to include a wide array of analytics capabilities – eliminating the complexity and delays associated with traditional approaches like data lakes that require moving data to yet another environment.

Q6. How can firms eliminate the friction that has been built up around accessing information, reduce the cost and complexity of data wrangling?

Joe Lichtenberg: Building a smart enterprise data fabric as a data layer to serve the organization enables firms to reduce complexity, speed development, accelerate time to value, simplify maintenance and operations, and lower total cost of ownership. Additionally, it enables organizations to execute analytics and programmatic actions on demand, by utilizing clean and current data that resides within the organization

Q7. What are your recommendations for firms that need to work with competitive insights?

Joe Lichtenberg: Competitive insights require access to accurate and current data that may reside in different silos in order to get maximum value. A data fabric provides the necessary access to this required data, but a smart data fabric takes this a step further. It incorporates a wide range of analytics capabilities, including data exploration, business intelligence, natural language processing, and machine learning, that enable organizations to visaulize, drill into and explore, and combine the data from different sources. This helps not just skilled developers, data stewards and analysts, but a wide range of users that are close to the business to gain new insights to guide business decisions and create intelligent prescriptive services and applications.

Q8. Where do you see Artificial intelligence and machine learning technologies playing a role for financial institutions?

Joe Lichtenberg: Advanced analytics are essential to the future success of financial institutions. AI, ML, and natural language processing (NLP) tools are already being utilized in various areas of financial services. Although some may argue that niche FinTechs lead the way in the adoption of these tools, established organizations are also utilizing AI and machine learning to increase wallet share, enhance customer engagement, and guide strategic decisions. However, these tools are only as effective as the data that powers them. Without healthy data, they can’t deliver accurate results. That is why it’s essential to place an emphasis on the quality of data that is collected and fed into these powerful tools.

Q9. The next generation of technology advancement must be built on strong data foundations. Artificial intelligence and machine learning require a high volume of current, clean, normalised data from across the relevant silos of a business to functions. How is it possible to deliver this data without requiring an entire structural rebuild of every enterprise data store?

Joe Lichtenberg: This is a common initiative for which a smart, enterprise data fabric is being used. But implementing such a reference architecture can be complex, requiring implementing and integrating many different data management technologies. A modern data platform that combines multiple layers and capabilities in a single product, reducing complexity by minimizing the number of products and technologies required, are helping to deliver critical business benefits with a simpler architecture, faster time to value, and lower total cost of ownership. For example, modern data platforms combine horizontally scalable, transactional and analytic database management capabilities, data and application integration functionality, data persistence, API management, analytics, machine learning, and business intelligence in a single product built from the ground up on a common architecture. Not only can the implementation of a smart enterprise data fabric with a modern data platform at the core help firms address current pain points, it accelerates the move toward a digital future without the costly rip-and-replace of their current operational infrastructure.

———————

Unknown-1

Joe Lichtenberg is responsible for product and industry marketing for data platform software at InterSystems. Joe has decades of experience working with various data management, analytics, and cloud computing technology providers.

Resources

Five Key Reasons to Invest in a Smart Data Fabric. InterSystems.

–  Accelerate Your Enterprise Data Initiatives with a Smart Data Fabric. InterSystems. (Download .PDF Link Registration required)

Related Posts

– On AI for Insurance and Risk Management. Interview with Sastry Durvasula. by Roberto V. Zicari on February 13, 2020

Follow us on Twitter: @odbmsorg