Navigating Data With LLMs - Arguments For And Against Sharing

One of the hottest issues in AI in coming years is undoubtedly going to be the use of data by LLMs to make their models better. It was reported – in the leadup to their IPO – that Reddit signed a deal with a large, unnamed AI company to license ten years worth of data. This obviously raises eyebrows from users, but what does it mean for the broader ecosystem? How should your company think about licensing data to developers of LLMs, or opting out of sharing your data?

How is your data being used right now for LLMs?

Data crawling for LLM training involves automated programs (crawlers) that scan the internet to collect texts, which are then used to train LLMs. Public web pages, unless specifically protected, can be crawled and their data used for training purposes. Organizations can use robots.txt files or no-follow tags to prevent their web pages from being crawled, but these measures must be correctly implemented and respected by the crawlers.

For closed models – like ones developed by Open AI – we are not totally aware of what data is used to create their LLMs. In the wake of the recent announcement of Sora by Open AI – a text to video model – many want to know what was used to create such magnificent results.

The main argument for sharing data – advancement

Sharing data can accelerate the development of more sophisticated and capable LLMs. There is a broad game theory at play here, which is if everyone shares their data, the generalized models will all get better accelerating every industry in the process.

There are several arguments against sharing

The main argument however – is that you could lose a massive competitive advantage if the generalized models end up training their data on your publicly available information. We are seeing this play out in various lawsuits, particularly in the media world.

Other organizations, like the WSJ, have explicitly added no-follow tags to their site’s robots.txt to prevent LLMs using their data to enhance generalized models. To the extent anyone reading this intends on adding these types of tags, they have to be added correctly to your robots.txt in order for you to opt out.

Business models for data

For organizations with large amounts of proprietary data – like Reddit – there are opportunities to monetize. Below are a couple of examples of business models for your data – to the extent it’s compelling enough.

Licensing Agreements: Instead of offering data freely, organizations can enter into licensing agreements with AI developers, specifying usage terms and receiving royalties.

Data-as-a-Service (DaaS): Offering datasets on a subscription basis to AI developers and researchers can provide a steady revenue stream while maintaining control over data usage.

Every organization should be thinking about how they want their data used in the new world of AI. It is a very company specific decision, and it requires a lot of deliberation before taking a strong position one way or another. However – one thing is clear – the pace of innovation is not slowing down. So if your company is built on data, which many are, this is a time of serious consideration before we start to see the next generation of LLMs.

What's On

Today’s NYT Mini Crossword Clues And Answers For Thursday, December 26

Why Killer Whales—Famous For Sporting Salmon ‘Hats’ In The 1980s—Might Be Doing It Again

MLB The Show 25: 5 Essential Concepts Needed To Improve The Series

How is your data being used right now for LLMs?

The main argument for sharing data – advancement

There are several arguments against sharing

Business models for data

Today’s NYT Mini Crossword Clues And Answers For Thursday, December 26

Why Killer Whales—Famous For Sporting Salmon ‘Hats’ In The 1980s—Might Be Doing It Again

MLB The Show 25: 5 Essential Concepts Needed To Improve The Series

NYT ‘Strands’ Today: Hints, Spangram And Answers For Thursday, December 26th

Today’s ‘Wordle’ #1286 Hints, Clues And Answer For Thursday, December 26th

‘Quordle’ Hints And Answers For Thursday, December 26

Starbucks’ new CEO has a long to-do list—moving the HQ out of Seattle is not at the top

Kremlin’s Nuclear Russian Roulette In Orbit Could Trigger NATO Clash

200 top CEOs reveal how AI is already transforming their sectors at Yale summit

Today’s ‘Wordle’ #1286 Hints, Clues And Answer For Thursday, December 26th

‘Quordle’ Hints And Answers For Thursday, December 26

3 Ways To Get Ahead Of A ‘January Divorce’—By A Psychologist

Meta Quest 3 Games To Try Out First

Our Picks

Today’s NYT Mini Crossword Clues And Answers For Thursday, December 26

Why Killer Whales—Famous For Sporting Salmon ‘Hats’ In The 1980s—Might Be Doing It Again

MLB The Show 25: 5 Essential Concepts Needed To Improve The Series

Most Popular

NYT ‘Strands’ Today: Hints, Spangram And Answers For Thursday, December 26th

Today’s ‘Wordle’ #1286 Hints, Clues And Answer For Thursday, December 26th

‘Quordle’ Hints And Answers For Thursday, December 26

What's On

Navigating Data With LLMs – Arguments For And Against Sharing

How is your data being used right now for LLMs?

The main argument for sharing data – advancement

There are several arguments against sharing

Business models for data

Related Articles