How AI and NLP can broaden data discovery, accessibility and maintain governance.
By Ayush Parashar, Co-Founder and Vice President of Engineering, Unifi Software
The challenge of controlling and protecting data is a big one, but the bigger question is how to make people more productive with corporate information while maintaining standards of compliance and governance for broad access and use in the age of digital business. Applying AI and Natural Language Processing within the various stages of data analytics is a key way to democratize data and build in safeguards for broad use. Below is a Q&A with Ayush Parashar, a Co-Founder and Vice President of Engineering with Unifi Software.
Q1. Often the quest for security can eclipse data usability. How can applying AI be used to both discover data and ensure information is not being seen or used by those that shouldn’t have access to certain kinds of data?
There are two important themes that have emerged within the enterprise: providing self-service access to data and governance of data. To get more value out of data there is a need for more self-service in every aspect of a business discipline.
At the same time, data privacy is a growing concern and very relevant. These two are orthogonal problem statements. If we want to provide more self-service it can lead to less governance and vice versa. This is where AI comes into play. It boils down to pattern recognition in multiple forms. Let me give you a couple of examples.
During data discovery you can mark certain data as sensitive based on a pattern that has been discovered before. Thus you can mark certain sets of data as sensitive through indirect inference. We use various machine learning methodologies to figure out similarity scores between datasets and columns. This is eventually used in the classification of data during discovery. This leads to an informed administrator who has visibility of columns or data that is not marked sensitive, however is very similar to sensitive columns. This all happens seamlessly to end users.
The second example, is on determining irregularities in user behavior. For this we use on-pattern recognition techniques to determine if the user’s behavior is far from normal. A case for this is when an administrator gets an alert and is flagged automatically, from the internal behavior and reporting algorithms, that a user is logging in from an odd location or viewing data in an irregular manner.
Q2. How can AI and natural language search disclose risk around the use of data?
Natural language search is the most user friendly interface to ask questions of data and metadata. It is very intuitive to ask questions like: ‘Who accessed the PII data in last 3 days?’, ‘Who has access to these customer records?’, ‘What are the PII columns that don’t have masking defined for them?’ or ‘Which users have the highest anomaly in their login behavior in the past month?’. AI is not only used in finding patterns associated with the risk but also in making NLQ interface stronger and deeper.
Q3. Are we really at the point where we can explore ‘What if’ scenarios on our data? Is NLQ mature enough to do that?
The use of a Natural Language Query interface has become very mature in the consumer space, consider Apple’s Siri, Amazon Alexa and Amazon Echo or Google Home. What you will notice is that features are added as modules or packages.
For example, there can be a weather package or flights added to the NLQ interface in consumer devices. Such modules and packages have been added over time in the consumer space as well, and today, it’s very rich. Similarly, while NLQ in the enterprise is behind the consumer industry, the platform has been laid in various applications where incremental progress will quickly lead to maturity. As an example, Unifi started with an NLQ interface to search for metadata and quickly grew to statistics, governance, relationship modules where you can ask more detailed questions around data and metadata.
Finally, in today’s world, engineering such interfaces is all about being super extensible wherein you can build quick NLQ interfaces for partners and other ecosystem aspects quickly. The idea is to create more open systems to be more extensible through API’s and standard constructs of expressing NLQ.
The next evolution in this journey is maturing NLQ options for broad and deep understanding of enterprise needs.
Very quickly, such maturity can lead enterprise-based NLQ being understood in multiple languages and interfaces, which gives an additional dimension to data and metadata democratization.
Q4. Rarely is all corporate data stored in one location. Most enterprise companies have some data in data lakes, data warehouses, silo or containers, or in hosted cloud environments. How can AI crawl all of these instances to get a complete view of a company’s data assets?
That’s very true. Data is scattered across on-premises, cloud and SaaS applications and many other database platforms that a company invests in. Mostly crawling of all these systems is done through interfaces provided by each vendor them.
For example, Amazon provides an interface to connect and crawl its S3 file-system. Oracle provides JDBC drivers to connect and import metadata from an Oracle database. Where AI adds value is to discover and compare data that may sit in separate silos. What we have seen is a vast amount of replication of customer data with different governance paradigms in different places—a spreadsheet in Office 360, a file in S3 or Azure Data Lake Storage, a table in a customer data mart, an object in a CRM system or a bug in a JIRA system may all contain similar customer information. AI allows you to understand the similarities between these datasets that are of different formats and sitting in different sources. Thus a data steward can figure out how sensitive information can be presently un-encrypted or assigned the wrong access to a user or unmasked.
At the same time a data engineer can use a formula that was present on an Excel sheet for a column in a database table — all because AI understood the similarity between columns and recommended use of a formula or function on the table column.
Q5. We often hear that AI and ML get better through continuous learning. What will applying AI across the entire data pipeline do, in your opinion, to advance business success?
Absolutely, more usage of a product leads to better learning of AI and ML models and that in turn leads to a better user experience and better results. The best example is NLQ – the more people that use the feature the better the recommendations are on auto complete and ‘people also ask for’. Similarly, recommendations in an application like Unifi get much stronger with the usage of the product — all through continuous learning, which is involved for various AI models.
Applying AI across the entire data pipeline will lead to very easy use of data by a business user in various areas like data discovery, and data preparation, even complete automation of these data preparation jobs. AI will eventually lead to increased self-service, and even improve governance rather than impeding it.