Tuesday, January 30, 2024

Is the Path to AI/ML Commercial Success Curated Wall Gardened Data Sources?

There has been quite a lot written on the subject of hallucinations and the increasing amount of erroneous results coming from the major AI/LLM systems, in part as more bad data is generated the more bad data exists on the internet for the major AI systems to "learn" from. A good article on the subject is Is ChatGPT Getting Worse Over Time?

This phase of technology development is very different from the origins of the web and search engines where the search companies dumped in everything to there indexing engines and then presented many results for users to sort through, with LLMs the user asks a specific question and expects a correnct answer, so the relationships between questions and answers are now 1:1 rather than 1:n - I would argue that this puts a higher burden on the providers of answers.

So how can the benefits of AI/ML be safely realized by businesses and governments? I would posit that at least until the LLMs that use public data can guarantee a higher quality the answer is to leverage well curated data sources. A good example of this is Apple working on licensing data from well known publishers, see: Apple Explores A.I. Deals With News Publishers.

So are the major beneficiaries of this wave of innovation the aggregators of clean curated data? Perhaps the recent Juniper/HPE deal is an indication of this. One of the rationals for this deal was discussed on a couple of podcasts on Silicon Angle  Research Analysis: HPE Acquires Juniper and  The AI evolution in tech: Pioneering smarter decisions, from surgery to security. I would argue if HPE can integrate its data from compute, storage and corporate wireless into MIST AI the combined company will be able to offer customers something very unique - the ability to manage their whole IT infrastructure through AI/ML that is safe and dependable.

This then raises the question for customers of data aggregators, why should I allow you to collect and use my data? There has to be a strong value proposition for the customer to share their data and a high guarantee of anonymization. In the case of MIST AI the benefit is improved IT management, giving unified management of compute, storage and networking. I imagine that HPE hopes that the benefit is that enterprises who buy Juniper will want to add HP gear and vice versa to leverage the single view of the enterprise provided by MIST.

There are many other curated information sources, health care, security, manufacturing, etc, but I believe in all these verticals the key to successfully deployment of AI/ML solutions is aggregating the data (and getting permission to do it, with the appropriate anonymization). Also aggregating it such that "truth" is maintained and aggregated from all the different sources. The ability of the human brain to do such "voting" on multiple reference frames is discussed by Jeff Hawkins in "A Thousand Brains". Being able to automate this knowledge collection and creating a clean knowledge base in multiple domains is one of the big challenges, I think, we face making AI/ML successful.

So not only is there a need to have a lot of data, there is also a need to have well organized data sets that have unambiguous facts collected from multiple sources that can be leverage to give accurate answers to questions that are are now 1:1 rather than 1:n. Customers will need to see a benefit to allow the aggregator to collect the data, as the value of the data increases there could be some interesting discussions on licensing. When a customer goes to a corporate support site, or management tool, and asks a question, the expectation is now a single correct answer not a list of search results that the user has to decide which is relevant to their problem.


No comments: