The data hoarding of large language models (LLMs) is yielding to a new discipline: the management of data. Data management is the art of tracking and organizing data into refined and indexed, filtered, and represented form. The data flywheel, as the practitioners call it, allows data to be scaled up in different contexts to facilitate collaborative systems where many models work in concert.?
The business benefits will be insights, efficiencies, and new discoveries and innovations in even the most entrenched areas of business. Some vendors will approach this as agents, others as domain experts. As usual in our industry, a good approach to technology will take on many names as marketeers try to corner a technology domain and make it unique to themselves.
There is also talk about using more powerful AIs to filter data for use by less complex AIs (e.g., a lower parameter model). This process of distillation shows a lot of promise and may become a strong factor in fine-tuning AI for targeted enterprise deployments. Open source will be disproportionately favored for distillation as the big AI service providers enforce licencing terms that prohibit users from using their system to train more specialized AI.
What do these new data-centric workflows look like? Think about the processes by which drugs are discovered and tested or how corporate financial systems are put through their quarterly and annual reconciliation chores. Quality data is obtained and used in bias-defeating systems like double-blind trials for drug companies or audited records for financial analysis. Analysis follows well-defined workflows, judging both results and specific risks with what-if hypotheses, prior to actions like releasing a drug or making a strategic investment.
Those are the big outlines. GenAI may work within corporate workflows like team collaboration by mimicking several of these processes, including better data identifiers and refining, to support the creation of more dependable results. The background processes, the math in creating data reliability, won’t change much from one team to another, but the metadata or labels will be specific to teams, tasks, and industries. This may not be super obvious but consider that how people collaborate can differ by industry and each organisation.
A new discipline, the data engineer, is already on the rise. That person or team will organize, prepare, and transform data to support this focus on .