On Using Unstructured Data to Produce Useful and Meaningful insights. Q&A with Paul Speciale.

Q1Large language models (LLMs) are the current shining stars and hold incredible potential and promise for business use, leveraging mainly structured (text-based) training data. Are these extravagant promises?

We believe there is exceptional potential for LLMs to deliver value, insights and monetization for businesses. However, we challenge the assertion that it is mainly structured data that will be leveraged – we see potential for incredible value extracted from unstructured data in the form of documents, images, event logs, IoT streams, backups, archives and many others. The combination of structured and unstructured data together in an intelligent data lake will make it possible to mine and extract complete value, especially in finding patterns and correlation of all this data. 

Q2. Scality predicted that in 2024, “businesses will discover that the value of their vast troves of unstructured data, in the form of images and other media, will become a useful source of insights through AI/ML tooling for image recognition applications in healthcare, surveillance, transportation and other business domains”. Do you believe this is happening right now?

Yes, this has begun in earnest with significant investments in first projects within major corporations across these industries. Companies are both excited about the prospect of leveraging these technologies, and also a fear of how it might be leveraged for advantages by their competitors if they don’t start to test and employ it in support of their own goals.

Q3Can you give us some examples on how to use unstructured data to produce useful and meaningful insights?

There are many current projects but two specific examples I can provide are:

  • in transportation: the use of images and video capture from autonomous vehicles is being processed through Machine Learning algorithms to create crash avoidance algorithms.
  • In healthcare: correlation of patterns in millions of MRI and CT scan image files through machine learning to provide early detection of cancerous tumors, and other defects.

Q4How do you manage data quality? If data is of poor quality, insights will also be of poor quality. Any thoughts on this?

The applications we see in large enterprises such as in financial services (for fraud detection for example), in healthcare (medical imaging) do not suffer this problem – they ensure that image quality is high enough and there are sufficient samples to deliver reliable results. In some surveillance applications, there is recognition of some image quality issues but we have certainly seen facial recognition algorithms working very effectively against commercial camera image qualities being deployed today. 

Q5You believe businesses will store petabytes of unstructured data in scalable “lakehouses”?  What exactly is a “lakehouse”?

Both traditional data warehouses and modern data lakes have their place and offer advantages for specific business purposes. As a combination of the two, the lakehouse in essence promises the benefits of both worlds, so it will be an interesting alternative for businesses dealing with very diverse data (including highly structured and unstructured data). 

Since object storage has become a defacto data repository for data lakes (both in the cloud such as with AWS S3 and with on-premises object stores), it is obvious that business will continue to store petabytes of data in them as part of new lakehouse initiatives.

Q6. What else is ahead for 2024/ 2025?

We expect corporations to take stock and measure the ROI of their first LLM projects, determine ways to optimize their results, and invest in even larger scale projects in the coming 1-2 years. This has major implications for data, the need to store massive volumes as well as the processing power needed to find insights in greater varieties and quantities of information.


Paul Speciale, CMO, Scality

Over 20 years of experience in Technology Marketing & Product Management. Key member of team at four high-profile startup companies and two fortune 500 companies.


Scality’s 2024 predictions: AI, hybrid cloud and ransomware detection will define the data storage landscape, while hard disk drives live on

Sponsored by Scality.

You may also like...