Wrestling Knowledge into Computable Intelligence
By Michael K. Bergman, Cognonto LLC
As we ask questions and get meaningful answers from our smart phones or see self-driving trucks safely navigate the roads of Colorado or witness facial recognition systems identify terrorists in airport videos, it is easy to forget that these capabilities did not exist five years ago. We are living in the midst of a revolution in artificial intelligence. Soon AI systems will be embedded in much of what we do for work and play.
But the idea of artificial intelligence (AI) is not new. The term itself was coined in the 1950s and there have been both winters and springs in the fortunes of AI. What has caused AI to blossom so greatly now?
Technology pundits tend to point to some common factors in the recent AI renaissance. Improved algorithms and faster graphics chips from gaming have been contributors.
But the most important factor in AI’s renaissance in many pundits’ view, certainly in mine, has been the availability of big data. But not just any big data; the data needs to be appropriate for the task and often labeled (annotated) to provide objective functions for the AI machine learners. In the case of image recognition, this big data comes from massive datasets of images cross-tagged to the objects they represent. In the case of language translation, this big data comes from terabytes of search engine queries and results in multiple languages. Each objective need of artificial intelligence poses its own requirements in terms of the nature, scope and style of the big data to help train it.
A major opportunity nexus for AI is in understanding natural language and discovery and reasoning over knowledge. This nexus has direct applications in question answering, intelligent search, data integration and interoperability, fact and entity extraction and tagging, categorization and classification, sentiment analysis, text generation and summarization, and relation extraction; in short, all of those areas associated with knowledge representation and analysis. The nature and structure of big data applicable to what is known as knowledge-based artificial intelligence, or KBAI, poses its own unique set of requirements.
Wikipedia and data from search engines are central to recent breakthroughs in KBAI. Wikipedia is at the heart of Siri, Cortana, the former Freebase, DBpedia, Google’s Knowledge Graph and IBM’s Watson, to name just a prominent few AI question answering systems. Natural language understanding is showing impressive gains across a range of applications.
To date, all of these examples have been the result of bespoke efforts. It is very expensive for standard enterprises to leverage these knowledge resources on their own.
As impressive as these efforts have been, we can still see at least three weaknesses — perhaps better called adolescence — with KBAI in its present form. First, significant massaging and re-processing of the input knowledge bases is necessary to stage the KBs for artificial intelligence purposes.
Second, despite huge KBs such as Wikipedia (in its 200+ language versions) or Wikidata or others, no single KB alone is sufficient for specific domain purposes. We need to combine multiple KBs and then combine them with the domain knowledge of enterprises and other specific sources in order to do productive work. And, third, while useful to human readers and for human purposes, the structure of these KBs is not geared to knowledge representation and AI. To get them into that form requires significant time and effort, which is one reason why only the biggest tech companies have exploited them. Let me briefly address each of these points below.
First, the staging and processing of KBs are hindered by the knowledge bases themselves. Often the result of multiple authors or crowdsourcing, even the largest KBs (such as Wikipedia) have gaps, inconsistencies and incoherent aspects in their organizational (category) structures. They were not designed from the outset to be machine readable (there are exceptions).
It is natural there would be challenges in whipping them into shape for AI purposes. It is not surprising that only the very largest tech firms and R&D initiatives have been able to devote the time and effort to re-process these sources for use in artificial intelligence.
Second, combining KBs is an exemplar par excellence for the challenges that have faced data and information integration for decades. Terminology differs; schema are inconsistent and non-overlapping; underlying premises and worldviews serve different audiences; the purpose of the knowledge base can favor different balances of text and structured data; update cycles and publishing formats are inconsistent. In all cases, where text is employed, the issues of extracting an understanding of entities and concepts and then resolving ambiguities remains. The fact is, combining knowledge bases is hard, tough, demanding work, that takes much time and effort. Even with high degrees of automation, final mappings must still be vetted and verified manually.
Third, of course, existing knowledge bases were designed to serve humans, not machines. The conceptual scheme and architecture of a system designed for knowledge-based artificial intelligence should first begin from an understanding of what applications or purposes are to be served. Many of these potentials are articulated in the listing of KBAI opportunities noted above. These considerations help us define what the great 19th century American logician, Charles S. Peirce, called the speculative grammar for our KBAI domain.
This grammar provides the nouns and verbs of KBAI, and should include such notions as concepts, entities, attributes of those entities, how we group entities into classes or types, relations between all of these various things, and how we name and refer to these things in a manner suitable to all human languages. Some have referred to a portion of this design as “things not strings”, and is one basis by which a knowledge base like Wikipedia can be readily extended to so many human languages.
The only tractable way to handle such integrations is by use of a common schema that itself can support logic and coherency tests, built upon this consistent grammar. Once done, such a computable foundation becomes a nucleus to test new mappings, aiding a snowball effect to grow the combined system.
The image that comes to mind is of an M.C. Escher etching of a snake eating its tail, or a woodcut of geometric fishes morphing into birds. Patterns and regularity and logic must reside at the core of a computable KBAI system. But it also must be one that is recognized as always imperfect, needing testing and improvement, and inherently extensible in order to capture the constant growth and change characteristic of knowledge.
These are some of the challenges and mindsets for wrestling human knowledge into a sustainable basis for artificial intelligence useful to language understanding and knowledge applications.
Supervised machine learning requires labeled positive and negative training sets. Deep learning and unsupervised learning requires properly bounded training corpuses of input documents (or big data). Whatever the form of learner, there is also the need for reference “gold standards” against which to test and refine the features and parameters used by the learners to obtain the most accurate desired results. To date, more than 90% of KBAI efforts have been spent on staging the input data and creating these reference structures simply to begin the initial analysis. Because of the time and costs to get to initial liftoff, the crucial effort to test, compare and refine learners gets starved.
The promise of properly structured KBs to serve KBAI purposes is that we can reverse this imbalance.
We are now showing how it is possible to combine and represent multiple KBs in a design that enables us to flexibly create the training sets and corpuses and gold standards for virtually any slice and domain of knowledge desirable, all accomplished in a more-or-less automatic fashion. By freeing our data scientists of this tedium, we can now focus on creating the effective applications that knowledge managers seek.
— Mike Bergman is the CEO of Cognonto, developer of the KBpedia KBAI knowledge structure, and a noted tech blogger.