Why AI Data Transparency Matters to Content Owners
Companies developing generative AI models use massive amounts of training data. This often includes copyrighted content used without the consent or compensation of the rights holders. But little is known about exactly which works are being used, and that’s why many are now asking for more transparency.
Several countries — including the UK, the U.S., and the European Union — have introduced laws or proposals that would require AI companies to publicly disclose the data used to train their models. The goal is to help content owners understand if, when, and how their works have been used, and by whom.
Two laws are already in place: a section of the EU’s AI Act, which will go into effect in August, and California’s AB 2013. However, a federal bill in the U.S. known as the “Big Beautiful Bill” includes a 10-year ban on new AI laws at the state level, which could prevent California’s rule from taking effect. Many believe a federal law on this issue will come soon.
In the UK, transparency has become a key issue in the Data (Use and Access) Bill, thanks to an amendment proposed by Baroness Beeban Kidron, a strong supporter of creators and the UK’s creative industries. The House of Lords has approved the amendment several times, but the House of Commons has removed it each time. The Lords passed it again just this week.
Kidron believes that transparency is essential to make copyright law enforceable, giving rights holders clear knowledge of when their works are used without permission. More than 400 artists, including Elton John, Paul McCartney, and Dua Lipa, have signed an open letter supporting the amendment.
The hope is that these rules will lead to a fair licensing system for training data, where creators are paid for the use of their work. Also, if AI companies know they must reveal what data they used, they may be less likely to train their models on illegally obtained content — like files downloaded through torrenting.
So far, though, data transparency hasn’t come up in licensing talks between AI companies and data marketplaces, according to two sources quoted by VIP+.
Whether these new rules will work depends mainly on two things:
1. How detailed the disclosures must be
If companies only provide vague summaries, rights holders won’t be able to take action. The EU AI Office is developing a template for a “sufficiently detailed summary,” but no one knows yet how specific it will be.
Last year, Open Future and the Mozilla Foundation published a policy brief suggesting how to define what “detailed” should mean.
2. How copyright law treats AI training
In the U.S., many AI companies say that using “publicly available” content for training is allowed under fair use. That argument is being tested in several lawsuits. Some companies, like OpenAI and Google, have even asked the government to create a legal exception that would make AI training officially allowed.
Meanwhile, some governments are considering changes to copyright law that would let AI developers train on copyrighted works without permission. One way is to allow a text and data mining (TDM) exception, which exists in some countries. This exception would let AI systems analyze content even without a license, as long as the creator didn’t opt out.
In Europe, many experts think that the EU AI Act’s requirement for companies to make “best efforts” to respect opt-outs is a sign that this TDM exception also applies to generative AI. The UK was also considering a similar exception, but it was paused after strong public opposition.
Source: Variety VIP
Share:
Companies developing generative AI models use massive amounts of training data. This often includes copyrighted content used without the consent or compensation of the rights holders. But little is known about exactly which works are being used, and that’s why many are now asking for more transparency.
Several countries — including the UK, the U.S., and the European Union — have introduced laws or proposals that would require AI companies to publicly disclose the data used to train their models. The goal is to help content owners understand if, when, and how their works have been used, and by whom.
Two laws are already in place: a section of the EU’s AI Act, which will go into effect in August, and California’s AB 2013. However, a federal bill in the U.S. known as the “Big Beautiful Bill” includes a 10-year ban on new AI laws at the state level, which could prevent California’s rule from taking effect. Many believe a federal law on this issue will come soon.
In the UK, transparency has become a key issue in the Data (Use and Access) Bill, thanks to an amendment proposed by Baroness Beeban Kidron, a strong supporter of creators and the UK’s creative industries. The House of Lords has approved the amendment several times, but the House of Commons has removed it each time. The Lords passed it again just this week.
Kidron believes that transparency is essential to make copyright law enforceable, giving rights holders clear knowledge of when their works are used without permission. More than 400 artists, including Elton John, Paul McCartney, and Dua Lipa, have signed an open letter supporting the amendment.
The hope is that these rules will lead to a fair licensing system for training data, where creators are paid for the use of their work. Also, if AI companies know they must reveal what data they used, they may be less likely to train their models on illegally obtained content — like files downloaded through torrenting.
So far, though, data transparency hasn’t come up in licensing talks between AI companies and data marketplaces, according to two sources quoted by VIP+.
Whether these new rules will work depends mainly on two things:
1. How detailed the disclosures must be
If companies only provide vague summaries, rights holders won’t be able to take action. The EU AI Office is developing a template for a “sufficiently detailed summary,” but no one knows yet how specific it will be.
Last year, Open Future and the Mozilla Foundation published a policy brief suggesting how to define what “detailed” should mean.
2. How copyright law treats AI training
In the U.S., many AI companies say that using “publicly available” content for training is allowed under fair use. That argument is being tested in several lawsuits. Some companies, like OpenAI and Google, have even asked the government to create a legal exception that would make AI training officially allowed.
Meanwhile, some governments are considering changes to copyright law that would let AI developers train on copyrighted works without permission. One way is to allow a text and data mining (TDM) exception, which exists in some countries. This exception would let AI systems analyze content even without a license, as long as the creator didn’t opt out.
In Europe, many experts think that the EU AI Act’s requirement for companies to make “best efforts” to respect opt-outs is a sign that this TDM exception also applies to generative AI. The UK was also considering a similar exception, but it was paused after strong public opposition.
Source: Variety VIP