One of the hottest issues in AI in coming years is undoubtedly going to be the use of data by LLMs to make their models better. It was reported – in the leadup to their IPO – that Reddit signed a deal with a large, unnamed AI company to license ten years worth of data. This obviously raises eyebrows from users, but what does it mean for the broader ecosystem? How should your company think about licensing data to developers of LLMs, or opting out of sharing your data?
How is your data being used right now for LLMs?
Data crawling for LLM training involves automated programs (crawlers) that scan the internet to collect texts, which are then used to train LLMs. Public web pages, unless specifically protected, can be crawled and their data used for training purposes. Organizations can use robots.txt files or no-follow tags to prevent their web pages from being crawled, but these measures must be correctly implemented and respected by the crawlers.
For closed models – like ones developed by Open AI – we are not totally aware of what data is used to create their LLMs. In the wake of the recent announcement of Sora by Open AI – a text to video model – many want to know what was used to create such magnificent results.
The main argument for sharing data – advancement
Sharing data can accelerate the development of more sophisticated and capable LLMs. There is a broad game theory at play here, which is if everyone shares their data, the generalized models will all get better accelerating every industry in the process.
There are several arguments against sharing
The main argument however – is that you could lose a massive competitive advantage if the generalized models end up training their data on your publicly available information. We are seeing this play out in various lawsuits, particularly in the media world.
Other organizations, like the WSJ, have explicitly added no-follow tags to their site’s robots.txt to prevent LLMs using their data to enhance generalized models. To the extent anyone reading this intends on adding these types of tags, they have to be added correctly to your robots.txt in order for you to opt out.
Business models for data
For organizations with large amounts of proprietary data – like Reddit – there are opportunities to monetize. Below are a couple of examples of business models for your data – to the extent it’s compelling enough.
Licensing Agreements: Instead of offering data freely, organizations can enter into licensing agreements with AI developers, specifying usage terms and receiving royalties.
Data-as-a-Service (DaaS): Offering datasets on a subscription basis to AI developers and researchers can provide a steady revenue stream while maintaining control over data usage.
Every organization should be thinking about how they want their data used in the new world of AI. It is a very company specific decision, and it requires a lot of deliberation before taking a strong position one way or another. However – one thing is clear – the pace of innovation is not slowing down. So if your company is built on data, which many are, this is a time of serious consideration before we start to see the next generation of LLMs.