On Analyzing Unstructured Data. — Interview with Michael Brands.
“The real difference will be made by those companies that will be able to fully exploit and integrate their structured and unstructured data into so called active analytics. With Active Analytics enterprises will be able to use both quantitative and qualitative data and drive action based on a plain understanding of 100% of their data”– Michael Brands.
It is reported that 80% of all data in an enterprise is unstructured information. How do we manage unstructured data? I have interviewed Michael Brands, an expert on analyzing unstructured data and currently a senior product manager for the i.Know technology at InterSystems.
Q1. It is commonly said that more than 80% of all data in an enterprise is unstructured information. Examples are telephone conversations, voicemails, emails, electronic documents, paper documents, images, web pages, video and hundreds of other formats. Why is unstructured data important for an enterprise?
Michael Brands: Well unstructured data is important for organizations in general in at least 3 ways.
First of all 90% of what people do in a business day is unstructured and the results of most of these activities can only be captured in unstructured data.
Second it is generally acknowledged in modern economy that knowledge is the biggest of asset of companies and most of this knowledge, since itʼs developed by people, is recorded in unstructured formats.
The last and maybe most unexpected argument to underpin the importance of unstructured data is the fact large research organizations such as Gardner and IDC state that: “80% of business is conducted on unstructured data”
If we take these tree elements together is even surprising to see most organizations invest heavily in business intelligence applications to improve their business but these applications only cover a very small portion of the data (20% in the most optimistic estimation) that are actually important for their business.
If we look at this from a different prospective we think enterprises that really want to be leading and make a difference will heavily invest in technologies that help them to understand and exploit their unstructured data because if we only look at the numbers (and thatʼs the small portion of data most enterprises already understand very well) the area of unstructured data will be the one where the difference will be made over the next couple of years.
However the real difference will be made by those companies that will be able to fully exploit and integrate their structured and unstructured data into so called active analytics. With Active Analytics enterprises will be able to use both quantitative and qualitative data and drive action based on a plain understanding of 100% of their data.
As InterSystems we have a unique technology offering that was especially designed to help our customers and partners in doing exactly that and weʼre proud our partners that actually deploy the technology to fully exploit a 100% of their data make a real difference in their market and grow way faster than their competitors.
Q2. What is the main difference between semi-structured and unstructured information?
Michael Brands: The very short and bold answer to this question would be to say semi-structured is just a euphemism for unstructured.
However a more in-depth answer is that unstructured data is a combination of structured and unstructured data in the same data channel.
Typically semi-structured data comes out of forms that foresee specific free text areaʼs to describe specific parts of the required information. This way a “structured” (meta)-data field describes with a fair degree of abstraction the contents of the associated text field.
A typical example will help to clarify this: In an electronic medical record system the notes section in which a doctor can record his observations about a specific patient in free text is typically semi-structured which means the doctor doesnʼt have to write all observations in one text but he can typically “categorize” his observations under different headers such as: “Patient History”, “Family History”, “Clinical Findings”, “Diagnose” and more.
Subdividing such text entry environments into a series of different fields with a fixed header is a very common example of semi-structured data. Another very popular example of semi-structured data is e-mail, mp3 or video-data. These data-types contain mainly unstructured data but these unstructured data is always attached to some more structured data such as: Author, Subject or Title, Summary etc.
Q3. The most common example of unstructured data is text. Several applications store portions of their data as unstructured text that is typically implemented as plain text, in rich text format (RTF), as XML, or as a BLOB (Binary Large Object). It is very hard to extract meaning from this content. How iKnow can help here?
Michael Brands: iKnow can help here in a very specific and unique way because it is able to structure these texts into chains of concepts and relations.
What this means is that iKnow will be able to tell you without prior knowledge what the most important concepts in these texts are and how they are related to each other.
This is why, when we talk about iKnow, we say the technology is proactive.
Any other technology that analyses text will need a domain specific model (statistical, ontological or syntactical) containing a lot of domain specific knowledge in order to make some sense out of the texts it is supposed to analyze. iKnow, thanks to its unique way of splitting sentences into concepts and relations doesnʼt need this.
It will fully automatically perform the analysis and highlighting tasks students usually perform as a first step in understanding and memorizing a course text book.
Q4. How do you exactly make conceptual meaning out of unstructured data? Which text analytical methods do you use for that?
Michael Brands: The process we use to extract meaning out of texts is unique because of the following: we do not split sentences into individual words and then try to recombine these words by means of a syntactic parser, an ontology (which essentially is a dictionary combined with a hierarchical model that describes a specific domain), or a statistical model. What iKnow does instead is we split sentences by identifying relational word(group)s in a sentence.
This approach is based on a couple of long known facts about language and communication.
First of all analytical semantics already discovered years ago every sentence is nothing else than a chain of conceptual word groups (often called Noun Phrases or Prepositional Phrases in formal linguistics) tied together by relations (often called Verb Phrases in formal linguistics). So a sentence will semantically always be built as a chain of a concept followed by a relation followed by another concept again followed by another relation and another concept etc.
This basic conception of a binary sentence structure consisting of Noun-headed phrases (concepts) and Verb-headed phrases (relations) is at the heart of almost all major approaches to automated syntactic sentence analysis. However this knowledge is only used by state-of-the-art analysis algorithms to construct second order syntactic dependency structure representations of a sentence rather than to effectively analyze the meaning of a sentence.
A second important discovery underpinning the iKnow approach is the fact, discovered by behavioral psychology and neuro-psychiatry, humans only understand and operate a very small set of different relations to express links between facts, events, or thoughts. Not only the set of different relations people use and understand is very limited but it is also a universal set. In other words people only use a limited number of different relations and these relations are the same for everybody no matter his language, education, cultural background or whatsoever.
This discovery can learn us a lot of how basic mechanisms for learning like derivation and inference work. But more important for our purposes is that we can derive from this that, in sharp contrast with the set of concepts that is infinite and has different subsets for each specific domain, the set of relations is limited and universal.
The combination of these two elements namely the basic binary concept-relation structure of language and the universality and limitedness of the set of relations led to the development of the iKnow approach after a thorough analysis of a lot of state-of-the-art techniques.
Our conclusion of this analysis is the main problem of all classical approaches to text analysis is they all focus essentially on the endless and domain specific set of concepts because they mostly were created to serve the specific needs of a specific domain.
Thanks to this domain specific focus the number of elements a system needs to know upfront can be controlled. Nevertheless a “serious” application quickly integrates several millions of different concepts. This need for large collections of predefined concepts to describe the application domain, commonly called dictionaries, taxonomies or ontologies, leads to a couple of serious problems.
First off all, the time needed to set up and tune such applications is substantial and expensive because domain experts are needed to come up with the most appropriate concepts. Second the foot print of these systems is rather big and their maintenance costly and time-consuming because specialists need to follow whatʼs going on in the domain and adapt the knowledge of the application.
Third, itʼs very difficult to open up a domain specific application for other domains because in these other domains concepts might have different meanings or even contradict each other which can create serious problems at the level of the parsing logic.
Therefore iKnow was built to perform a completely different kind of analysis because by focussing on the relations we can build systems with a very small footprint (an average language model only contains several 10.000s relations and a very small number of context based disambiguation rules).
Moreover our system is not domain specific but it can work with data from very different domains at the same time and doesnʼt need expert input. Splitting up sentences by means of relations and solving the ambiguous cases (this means the cases in which a word or word group can express both a concept and a relation e.g. walk: is a concept in this sentence: Brussels Paris would be quite a walk. and a relation in this sentence: Pete and Mary walk to school) by means of rules that use the function (concept or relation) of the surrounding words (or word groups) to decide whether the ambiguous word is a concept or a relation is a computationally very efficient and fast process and ensures a system that learns as it analyses more data because it kind of “learns” the concepts from the texts because it identifies them as “the groups of words between the relations, before the first relation and between the last relation and the end of the sentence.
Q5. How “precise” is the meaning you extract from unstructured data? Do you have a way to validate it?
Michael Brands: This is a very interesting question because it raises two very difficult topics in the area of semantic data analysis namely : How do you define precision and How to evaluate results generated by semantic technologies ?
If we use the classical definition of precision in this area, it describes what percentage of the documents given back by a system in response to a query asking for documents containing information about certain concepts actually contains useful information about these concepts.
Based on this definition of precision we can say iKnow scores very close to a 100% because it outperforms competing technologies in itʼs efficiency to detect what words in a sentence belong together and form meaningful groups and how the relate to each other.
Even if weʼd use other more challenging definitions of precision like: the syntactic or formal correctness of the word groups identified by iKnow we score very high percentages, but itʼs evident weʼre dependent of the quality of input. If the input doesnʼt accurately uses punctuation marks or contains a lot of non-letter characters that will affect our precision. Moreover how precision is perceived and defined varies a lot from one use case to another.
Evaluation is a very complex and subjective operation in this area because whatʼs considered to be good or bad heavily depends on what people want to do with the technology and what their background is. So far we let our customers and partners decide after an evaluation period whether the technology does what they expect from it and we didnʼt have “no goes” yet.
Q6. How do you process very large scale archives of data?
Michael Brands: The architecture of the system has been set up to be as flexible as possible and to make sure processes can be executed in parallel where possible and desirable. Moreover the system provides different modes to load data: A batch-load of data which has been especially designed to pump large amounts of existing data such as document archives into an system as fast as possible, a single source load thatʼs especially designed to add individual documents to a system at transactional speed, and a small-batch mode to add limited sets of documents to a system in one process.
On top of that the loading architecture foresees different steps in the loading process: data to be loaded needs to be listed or staged, the data can be converted (this means the data that has to be indexed can be adapted to get better indexing and analysis results), and, off course the data will be loaded into the system.
These different steps can partially be done in parallel and in multiple processes to ensure the best possible performance and flexibility.
Q.7 On one hand we have mining text data, and on the other hand we have database transactions on structured data: how do you relate them to each other?
Michael Brands: Well there are two different perspectives in this question:
On the one hand itʼs important to underline that all textual data indexed with iKnow can be used as if it was structured data, because the API foresees appropriate methods that allow you to query the textual data the same you would query traditional row-column data. These methods come in 3 different flavors: they can be called as native caché object script methods, they can be called from within a SQL-environment as stored procedures and they are also available as web services.
On the other hand thereʼs the fact all structured data that has a link with the indexed texts can be used as metadata within iKnow. Based on these structured metadata filters can be created and used within the iKnow API to make sure the API returns exactly the results you need.
Michael Brands previously founded i.Know NV a company specialized in analyzing unstructured data. In 2010 InterSystems acquired i.Know and since then he is serving as a senior product manager for the i.Know technology at InterSystems.
i.Know’s technology is embedded in the InterSystems technology platform.
- Managing Big Data. An interview with David Gorbet (July 2, 2012)