Data: The (UnSexy) Base Layer Upon Which All AI Systems Utterly Depend
AI systems have voracious appetites for data. Voracious may even be an understatement.
A massive abundance of data is needed to feed and so to train all LLM systems (like ChatGPT and Gemini) to enable them to learn. Data is also the root power behind all AI-enabled analytics efforts. (Such as we saw in the prior interview on this channel with Fred Brown, who has developed AI-driven algorithms that look at tens of thousands of variables in pharma and health care applications. You can see that here.)
But, and it's a big “but,” data you may find out there is of notoriously uneven quality. There's so-so data, good data, great data, fake data, and a lot very bad data to contend with. So what's an organization to do that intends to leverage AI-driven analytic tools?
Cleaning this “base layer” data is a must. Assuring quality, consistency, and trustworthiness, not to mention - oh yeah - security.
To learn more about this critical topic, last week Langdon interviewed Srujan Akula, CEO of The Modern Data Company. Srujan describes these underlying challenges in depth, and explains how Modern's DataOS software operates to assure high quality, enabling analysts to spend their time actually doing analytics, instead of wasting tons of time merely hunting for and cleaning data to get it ready for analytics.
And this definitely matters. For example, a large firm that we have worked extensively with has experienced this problem, acutely. Their analytics team consists of a few hundred very highly educated and highly paid staff. Perhaps surprisingly, a couple years ago they observed that they were spending more than 50% of their time searching for data, and cleaning it up once they found it, so that they could then do what the company was actually paying them to do, namely the analytics. This was a source of extreme discomfort throughout the client’s organization, from IT all the way to the senior team, and so a great deal of money was spent over a span of years in a massive effort to alleviate the problem.
Modern would like to be the out-of-the-box solution to this problem, which is shared by nearly every major corporation worldwide, and many mid-sized ones as well. It wishes to do so through speed - one of the most compelling value propositions of DataOS is that built-in automation reduces the time to value "from months to minutes." Yes, Srujan really did mean to say that, and if Modern really can make good on that promise, then it’s in for a long run of outstanding success.
Some of the other highlights …
Srujan suggests that many (most?) organizations are struggling to get sufficient value from the massive investments they have made in data collection and storage systems, because even amid the abundance of data that they have created, there remain massive inconsistencies in critical qualities. Modern considers these dimensions to be critical to a full assessment of data quality:
Accuracy
Completeness
Freshness
Schema (that is, how the data is organized)
Uniqueness
And Validity
Recognizing all these factors as areas of growing importance, a few years he and his co-founder, Animesh Kumar set out to make a system that handles all these functions as automatically as possible, thus enabling the analysts to do legitimate analytics, instead, as mentioned above, of spending (wasting) tons of time and therefore money doing data gathering and cleansing, etc.
Data Products
A key idea that they’ve been focusing on over the last year is the notion of a “data product.” This is a data element that has been verified and validated and is ready to be plugged in to a sort of analytics that may be required. It is therefore a business asset.
The term “product” is intended to indicate its readiness, somewhat like ready-to-wear clothing (the French would say ‘prêt à porter’), or ready-to-eat (‘prêt à manger’) prepared food.
He's a graphic from Modern that explains this concept a bit more:
The results of productizing have been encouraging – this is where tasks that previously may have taken months have recently been accomplished in minutes. Some of the customer stories and endorsements of Modern are genuinely impressive.
This increase in productivity implies that a significant revolution in data management is emerging, and of course as the co-founder of a Silicon Valley start-up, that’s exactly what Srujan has been targeting. With strong venture capital backing, the fortuitous timing of the emerging AI revolution, and an increasingly effective sales effort, the revolution has possibly begun.
Data and AI
So how exactly does all this play into the discussion about AI?
Two factors become apparent immediately. First, as mentioned above, Large Language Models, the famous LLMS like Open AI’s Chat GPT and Google’s Gemini, are trained on data, so naturally the quality of the data (as well as its sheer quantity) will be critical to the quality of a resulting model. Train a model on bad data and you will sadly experience the age-old axiom of computer programming, “Garbage-in, garbage-out.” Given that training a LLM is becoming rather expensive, this is possibly going to be very pricey garbage that you get back. So clearly an investment in data purification prior to training will make sense for many organizations, and even become mandatory, and standard.
And of course we’re not really talking about generalized LLMs that are do-it-all knowledge engines, but rather the specialized models that organizations are now creating to handle organization-specific tasks. These models are often called “agents.”
The Agent Universe
They have varying degrees of capability, but an unvarying purpose, to get work done for organizations. These are highly specialized, and task-specific. You may have seen that in a prior interview on our youtube channel, Tom Brazil told us that his company (ICS) has already developed 80 such agents, and more are being created all the time. Using these agents, Tom reports efficiency gains of up to 20x, and he’s not kidding about that. You can watch that here.
So an entire “agentic universe” is being born, which may not be as sexy as the “Marvel Cinematic Universe,” but it is of great significance to all organizations that depend for their proper functioning on data. (And that is indeed all organizations.)
Efficacy relies heavily (emphatically) on data, hence the absolute significance of data quality, hence the value of a tool such as DataOS.
A second value dimension lies in the application of AI to the very process of improving the quality of the data, and naturally Modern is doing this, too. We might call this front-end, or upstream AI, while training AI on the resulting data could be thought of as back-end, or downstream AI. There is a suggestive charm in using AI to improve the quality of the data that is used to create new AI capabilities. But it also implies something far more significant: this is a process that can accelerate itself.
And so the singularity creeps ever closer, as AI systems improve AI systems, a positive feedback learning loop with enormous possibilities. This is one reason that the AI revolution is accelerating; the implications are significant and far-reaching.
This may be the gritty and un-sexy infrastructure which supports the digital revolution, but it’s the grit as well the infrastructure upon which the whole enterprise, and ultimately now the entire economy, depend. Hence, this interview has information that you may well need.
Please check out the video and hear it all in Srujan’s own words. Many thanks for Srujan for spending the time to educate us about this critical element of the AI plumbing!
What MIT Is Saying
Our youtube interview video starts out with a quick quote from MIT Technology Review, which not long ago surveyed 350 Chief Data Officers, and then prepared a report entitled “Building a High-Performance Data and AI Organization.”
The very first paragraph of the resulting report says this:
Yes this is so true, abundant, high-quality, and easily accessible data is now an imperative. It looks like Srujan and Modern may be sitting in a very sweet spot indeed.
As always, thanks for reading, and thanks for checking out all the videos on our YouTube Channel, AI Impact and Strategy.
Please sign up for our newsletter if you haven’t done so yet.
and subscribe to the YouTube channel as well.
We always welcome your comments.
Please contact Langdon at LMorris@innovationlabs.com