From boardroom bedlam to courtroom drama, Sam Altman has had a tumultuous three months. In December, the New York Times filed a federal lawsuit against OpenAI, alleging that the company infringed on its copyrights by using its articles to train AI technologies like ChatGPT. This isn’t the first time OpenAI has been accused of copyright infringement – recall the suits brought by The Author’s Guild, Sarah Silverman, and several additional groups of authors – but this is by far the strongest. The case lays out striking similarities between NYT articles and ChatGPT outputs and blatant examples of “regurgitation.”
News publishers have been getting the short end of the stick from Big Tech for a while. AI is well-positioned as the “future of search,” but because GenAI models are trained on publishers’ content and material across the web for free, there is a risk that users stop engaging with publishers’ sites at all. If users don’t click through to publishers’ sites and content, publishers face imminent extinction, threatening free and fair press and democracies everywhere.
Google and Facebook have together turned publishers into hollowed-out husks, mined for content and harvested for money. OpenAI is on the cusp of serving up the KO. In this case, whether The New York Times or OpenAI prevails will be a saga worthy of popcorn and a hot mug of cocoa.
Let’s not forget that the spirit of copyright law is to encourage and protect innovation. There’s clearly a public interest in protecting copyright and supporting the New York Times, but there’s also a public interest in enabling free, unfettered innovation. So how can we strike the right balance?
I’ve argued for clear, middle-of-the-road, transparent rules that apply broadly to big tech companies. It is not just a social responsibility for business but a competitive necessity in risky markets to serve as responsible stewards of data. Companies can show commitment to ethical data by enacting principles such as privacy, agency, transparency, fairness, and accountability at every stage of data, from collection to use and retention.
OpenAI’s public explanation boils down to three counterarguments: First, they allege training is fair use, but they provide an opt-out; second, “regurgitation” is a rare bug they’re driving to zero; and third, The New York Times is not telling the full story. While it’s possible the New York Times isn’t telling the full story, OpenAI certainly isn’t either.
While the legal elements are up to the courts to decide, the bigger issue I see is that OpenAI clearly violates three key ethical data principles: agency, fairness, and transparency. In Silicon Valley, the AI race is leading companies, including OpenAI, to prioritize grabbing as much data as possible, often neglecting ethical considerations that impact everyone else.
The principle of fairness means that businesses must measure and mitigate the impact of data systems and the outputs in machine learning, intelligent systems, and artificial intelligence that may have disparate impact or bias in application. But let’s be clear: if anyone tells you that OpenAI computer scientists actually understand what’s happening in their models, run for the hills. LLMs are black boxes (where humans who designed models cannot explain how the model uses data to produce an end result), and regurgitation isn’t a simple “bug” that can be fixed with the wave of a wand, as OpenAI implies. That said, the best measure of fairness is for platforms and the businesses that use them to ensure the outputs from new technologies are fair and unbiased. The Rite Aid case serves as one recent cautionary example regarding an emerging technology: The FTC proposing a five-year ban on their facial recognition technology due to allegations of biased usage in predominantly minority communities.
The principle of transparency is the simple fact that businesses must communicate, in simple language, how they will use the data they collect, who they will share that data with, and how long they plan to store the data. Most blatantly and despite its name, OpenAI is not candid about what data its models have been trained on. Additionally, they do not inform users when the generated outputs have been trained on copyrighted materials. In this way, OpenAI fits neatly within the tradition of Silicon Valley giants who justify dubious data standards to drive profit.
Lastly, the principle of agency is that people should be given choice and control over how their data is used and have the power to change their decision at any time. For companies, the principle of agency can be achieved through “opt-out,” transparency and true orchestration (which is programmatically respecting people’s choices downstream). In the case of OpenAI, while they provide a new opt-out feature that allows users to restrict the sharing of their input with ChatGPT for training purposes, they have done it in a way that blatantly discourages users from doing so. Additionally, ChatGPT has been trained on vast amounts of data, including data that was obtained without user consent, but also some content that has been licensed. To justify the former, OpenAI alleges that training their models is “fair use,” which permits limited use of copyrighted content without permission under specific circumstances. OpenAI isn’t denying copyright; they’re looking to justify it as “fair use” and wrap it in a “progress trumps everything else” argument. However, the concept of training the model with copyrighted content with the intent to profit off the output is certainly not “fair use,” and I expect the New York Times to prevail on this point as the case unfolds.
In the race for AI dominance, companies training large models are secretly guarding data sources as part of their “secret sauce” to set their products apart. It does not have to be this way. Apple is providing an alternative by pursuing fair, transparent licensing agreements with publishers. Instead of secretly ripping off training data, companies can ethically and transparently source data in a way that respects creators while building cutting-edge models.
Historically, Copyright law always comes into play when new technologies emerge, like when Napster got hit with permitting the exchange of copyrighted music owned by major music labels or when Google’s scanning of library books and displaying free “snippets” online violated its members’ copyright. The stakes for AI are similarly high.
The case is not only a wake-up call to enact ethical data principles but also an argument for innovative solutions that connect content creators and data licensors to data marketplaces, application companies, and foundational model companies that need to license protected data.
After raising $13 billion from Microsoft, OpenAI can afford to settle in this case and others like it. I’m optimistic that OpenAI and Sam Altman will conclude that they can’t just kick the can down the road – and that, if they tried, they will pay for years to come.