19 February 2025
Artificial intelligence (AI) models today are hungry for data. Systems like chatbots and image generators learn from huge collections of text, images, and other information gathered from the internet. Often, this data is collected through data scraping – a process where software automatically extracts information from websites and databases.
While scraping provides the fuel for AI innovation, it also raises serious intellectual property (IP) questions. Content creators and companies worry that AI firms are using their work without permission. This tension has sparked debates (and even lawsuits) about how to balance the benefits of AI development with the rights of content owners.
Understanding Data Scraping in AI
Data scraping, sometimes called web scraping, means using a computer program (like a bot or crawler) to collect large amounts of data from online sources.
Unlike a person manually copying information, a scraping program can quickly gather thousands or millions of pieces of content. AI developers rely on scraping to build the vast datasets needed to train AI models. For example, an AI like ChatGPT “learns” from text scraped from countless websites, books, and articles. Scraping lets AI access far more information than a human could ever read, uncovering data hidden deep in websites or databases. This machine-driven mining of information is a big reason AI has advanced so rapidly – but it’s happening so fast that laws and norms are struggling to keep up.
Why Intellectual Property Matters
Intellectual property refers to creations of the mind that the law protects, such as writings, art, inventions, brand names, and secret formulas. Two key types of IP relevant to AI data scraping are copyrights (which protect creative works like text, images, music, etc.) and trade secrets (confidential business information that has value).
Other related legal concepts include contractual rights (like website terms of use), unfair competition laws, and even trademarks or database-specific rights. When AI models scrape data, they might run into several of these protections. Here we break down the main legal concerns in simpler terms:
Copyright Concerns
When an AI scrapes content from the web, it might be copying someone’s copyrighted material (for instance, an article, a photo, or a piece of artwork). Copyright law gives creators exclusive rights over their work – including the right to control reproductions. If an AI company copies thousands of artworks or articles without permission, it could be seen as copyright infringement (violating the owner’s rights). One crucial point is that while a human browsing a website is implicitly allowed to view and temporarily store content (a sort of “implied license” to use the content for personal viewing), this does not extend to automated bots doing mass copying. In other words, just because a website is publicly viewable doesn’t mean an AI can legally scrape everything from it. This distinction has been the basis for copyright lawsuits against scrapers.
A big question is whether using copyrighted material for training an AI could be allowed under exceptions like fair use. Fair use (or similar doctrines like fair dealing in other countries) allows limited use of copyrighted content without permission in certain cases (like criticism, research, or parody). AI companies argue that teaching an AI from many works might be a transformative, fair use – the AI isn’t just republishing the works, it’s learning patterns. However, these defences are uncertain and vary by jurisdiction. In fact, in many places the law is not settled on this issue.
The first hints of how courts might handle this are now emerging. For example, in the recent U.S. Federal Court case (Thomson Reuters v. ROSS Intelligence: No. 1:20-CV-613-SB (D. Del. Feb. 11, 2025), the court addressed whether an AI startup infringed copyrights by scraping a legal database to train its model. The court ruled that there had been infringement and not fair use. However, it did offer some insight: it suggested that if the AI’s use of the text was transformative (i.e. the AI only learned from the text and didn’t simply republish it), it might lean towards fair use. Still, this area of law is so new that each case could set an important precedent and it is highly likely that this is likely only the first round in the ongoing legal battles between rights owners, large language models, and the generative AI industry. Content owners like news organizations, authors, and image libraries are closely watching these developments.
Trade Secret Issues
Not all valuable data on the internet is meant to be public. Trade secrets refer to confidential data or know-how that companies actively keep secret (like a secret recipe or an unreleased product database). In some scenarios, data scraping can edge into trade secret theft. How? Imagine a company has an online database that seems public because you can query pieces of it, but it’s designed so no one can copy the whole thing easily. A human user might retrieve one entry at a time but couldn’t practically collect the entire database. However, an AI “bot” can scrape piece by piece and assemble the whole database – potentially exposing information the company considered proprietary. U.S. courts have grappled with this situation.
In one notable case (Compulife Software, Inc. v. Newman: 959 F.3d 1288 (2020), a competitor used a bot to scrape an online database of life insurance quotes that was technically publicly accessible but not in bulk. The 11th Circuit Court of Appeals concluded that even though a regular user had permission to access individual quotes, using a bot to harvest millions of quotes crossed a line. The automated scraping was deemed “improper means” because it collected data at a scale no human could, and thus could count as misappropriating trade secrets by copying the entire database. In simple terms, the court said just because an AI can grab all that data doesn’t mean it should – and doing so may violate the rights of the data owner. This is just one court’s view; it’s unclear if others will agree, but it set an example of how scraping publicly available data in bulk might still be illegal if it undermines someone’s secret or proprietary compilation of data.
Contracts and Terms of Use
Aside from IP laws like copyright and trade secrets, data owners often use contracts to protect their information. The most common example is a website’s Terms of Service or Terms of Use. Have you ever seen a website’s terms that say something like “no automated scraping allowed” or “you must not copy content without permission”? When you visit or use that site, you may be agreeing to those terms (even if you didn’t actually read them). Website owners include anti-scraping clauses to make it clear that scraping is not authorised. These terms can be legally enforceable, essentially forming a contract between the site and the user/bot. If a scraper violates the terms, the website owner might sue for breach of contract. For instance, in Europe a famous Irish High Court case in late 2023 involving Ryanair (an airline) established that a company that scraped flight data from Ryanair’s site violated the site’s terms of use, which a court held were an enforceable contract.
However, enforcing these terms is not always straightforward. First, one has to catch the scraper and prove they agreed (or were bound by) the terms. Second, even if you prove a breach, showing actual damage caused by scraping can be difficult. Despite these challenges, Terms of Use remain a key line of defence – they put scrapers on notice that their actions are unauthorised, strengthening the content owner’s position if a dispute arises.
Unfair Competition and Other Legal Angles
Different jurisdictions have other laws that can come into play. In some countries, taking someone else’s data en masse might be seen as unfair competition or a similar violation. For example, China’s Anti-Unfair Competition Law has been interpreted to cover data scraping scenarios. If a dataset was compiled through someone’s significant effort or investment, and a scraper simply takes that dataset to use for their own benefit, Chinese law might view it as an unfair exploitation of the original collector’s effort. The idea is that one business should not free-ride on another’s work in a way that harms the original business’s interests.
In the European Union, there’s a concept of database rights (a sui generis right for database creators) that might protect substantial collections of data, even if the individual pieces aren’t copyrighted. Meanwhile, trade mark issues could arise if, say, an AI uses a company’s logos or watermarks in a way that confuses people. (In one real 2023 US case, Getty Images found that some AI-generated images generated by Stability AI mimicked its watermark, leading to a claim of trademark infringement alongside copyright issues.) We also have privacy laws (like GDPR) if personal data is scraped, but that’s a whole other complex topic beyond IP. Overall, the legal landscape is patchwork: a scraper’s liability might depend on where they are, what kind of data they take, and how that data is protected by various laws.
Real-World Disputes and Reactions
These concerns aren’t just theoretical — several high-profile disputes have made headlines, highlighting how data scraping and IP rights are clashing in practice. Here are a few examples that show the scope of the issue:
- Twitter (X) vs. Meta (2023) – In July 2023, Twitter (recently rebranded as X) sent a cease-and-desist letter to Meta, accusing Meta of scraping Twitter’s data to help build its new Threads platform. The letter alleged that Meta hired former Twitter employees and unlawfully scraped data about Twitter’s users (followers) to gain an advantage. Twitter implied this might violate its rights (possibly its terms of service or trade secret laws). This was a warning shot rather than an immediate lawsuit, but it showed how seriously companies like X/Twitter guard their data. Meta denied the allegations, but the incident put the spotlight on how competitive the social media data race has become.
- Getty Images vs. Stability AI (2023) – Getty Images, a major stock photo company, filed a lawsuit against the AI firm Stability AI. Stability AI is known for Stable Diffusion, an image-generating AI model. Getty alleged that Stability AI scraped over 12 million of Getty’s copyrighted photos from the web without permission to train Stable Diffusion. Getty argued this massive copying infringed its copyrights – they said if others want to use their library of images, they need to obtain a license (and Getty in fact licenses data to some AI developers, which Stability did not do). The lawsuit even included a trademark twist: because some AI-generated images appeared with a distorted “Getty Images” watermark, Getty claimed consumers might be confused into thinking those AI images were affiliated with Getty. This case, filed in U.S. federal court, was one of the first major legal tests of scraping for AI image training, and in a January 2025 decision of the High Court of Australia on the same matter, the outcome has created big implications for AI companies using internet images. In this Australian chapter of the global dispute between these parties, a key focus was on the role of the Sixth Claimant, a named individual, as a representative of a large number of copyright owners, an issue which goes to the heart of how the law grapples with mass claims in the "big data" era, where claims centre around the use of the data and digital works of vast swathes of people en-masse in the development of tech products such as large language AI models. The court ruled that it lacked jurisdiction to allow the claim to proceed as a representative action. Alternatively, even if jurisdiction existed, the court decided against permitting the claim to proceed as a matter of discretion. The primary issue was the class definition, which relied, in part, on whether a copyright owner's works had been used to train Stable Diffusion. However, there was no definitive list of copyrighted works included in the training data, and the defendant had only made limited admissions, acknowledging that “at least some” copyrighted works had been used. As a result, it was impossible to determine whether any specific individual qualified as a member of the represented class.
- Thomson Reuters vs. ROSS Intelligence (2020-2023) – Thomson Reuters, the company behind the Westlaw legal research database, sued a startup called ROSS Intelligence. ROSS had developed an AI legal research tool and allegedly used a data-scraping technique to copy content (legal case summaries called headnotes) from Westlaw’s database without permission. Westlaw’s content is copyrighted and behind a paid subscription, so Reuters claimed this was outright infringement. The case reached summary judgment in 2023, where a judge had to consider if training an AI on those summaries is protected by fair use. The court decided the issue should go to a jury – meaning it wasn’t clear-cut – but along the way the judge analysed fair use factors in depth. This case is significant because it’s shaping how courts think about AI training on copyrighted text. It shows that even informational content (like legal summaries) can trigger copyright fights when used to train AI. The lawsuit was closely watched in tech and legal circles and eventually was settled before trial (with ROSS shutting down under the pressure). It underscored that AI companies can face serious legal challenges if they rely on proprietary data without authorisation.
These examples illustrate the broad range of data scraping conflicts – from social media data and trade secrets, to stock photos and copyrighted text. Each case helps define the blurry line between acceptable data use and IP violation.
Looking Ahead: Balancing Innovation and Rights
AI’s rapid growth has created an urgent need to balance innovation with respect for intellectual property. On one hand, data scraping is essential for AI development – without large-scale data, we wouldn’t have the powerful AI tools we do today. On the other hand, creators and companies deserve to have their rights and investments protected. Going forward, both legal and technological solutions are being explored to address this challenge.
Evolving Legal Frameworks: Lawmakers and courts around the world are starting to clarify how existing laws apply to AI, and in some cases, to create new rules. Notably, approaches differ by region. For example, the European Union has updated its laws to allow certain text and data mining exceptions: non-commercial research organizations (like universities or museums) are now allowed to scrape and mine copyrighted material for research purposes without explicit permission, as long as rights holders don’t object. However, for commercial AI developers, rights holders in the EU can opt out, meaning companies still need licenses to scrape data for profit. The United Kingdom considered a broad exception to allow data mining even for commercial AI, but it has paused that plan after pushback from content industries. Other countries are also acting: Singapore introduced an explicit exception for computational data analysis, which permits scraping but requires that the scraper have lawful access to the data. China, in turn, is emphasizing that AI training data should exclude content infringing IP rights. These efforts show a trend: governments are trying to encourage AI research and development while still giving creators tools to protect their works. We may see new laws or amendments in the coming years to spell out what AI developers more clearly can and cannot do with scraped data.
Industry Responses and Best Practices: Meanwhile, the tech industry isn’t waiting idly. Some AI companies are proactively striking licensing deals with content owners – essentially paying for the right to use certain datasets. This can be a win-win: the AI firm gets reliable, high-quality data, and the content owner gets compensation and control. We’ve seen this with companies negotiating access to news archives, artistic works, or stock image libraries rather than just scraping them. There’s also talk of industry standards or codes of conduct for AI data scraping. Such a code could, for example, set guidelines for respecting robots.txt files (which indicate if a site allows crawling), honouring opt-out requests from creators, or limiting the amount of content taken from a single source. Another idea is increasing transparency: AI developers could disclose the sources of their training data in a general way. In fact, the draft of the EU AI Act includes a requirement for AI model providers to publish summaries of their training data (like what types of data and from where). This kind of transparency could help identify if protected works were used and ensure compliance with opt-outs or licensing. It might also build trust with the public and content creators.
Technological Measures: On the flip side, content owners are using tech to protect their data. Techniques collectively known as Technological Protection Measures (TPMs) or Digital Rights Management (DRM) can deter or detect scraping. For example, websites can deploy anti-bot systems that distinguish human visitors from scrapers (think of those “I'm not a robot” CAPTCHA tests, or more behind-the-scenes methods that block rapid-fire requests). Companies like Getty Images have used hidden markers (like watermarks or metadata) in their content so that if it shows up in an AI’s output or dataset, they have proof of scraping. However, scrapers are finding ways around some of these measures. AI training often involves “data cleaning” which might remove watermarks or metadata. It’s a bit of an arms race between scrapers and protectors – smarter bots versus smarter shields.
In conclusion, the intersection of AI data scraping and intellectual property is a developing story, one that requires careful navigation. The tension between protecting IP and fostering AI development highlights the need for clearer legal frameworks and proactive measures. No one wants to stifle innovation – AI has great benefits – but creators should feel secure that their rights won’t be ignored in the process. Finding the right balance is key. This might involve updated laws, industry self-regulation, new licensing models, and technology solutions working in concert.
Striking a balance between innovation and rights protection is crucial for the sustainable growth of AI technologies. As we move forward, expect to see more dialogue between tech companies, lawmakers, and content creators to ensure that AI can continue to learn and create in a way that respects the very human creations it’s learning from.
For more information and to read the full report on these issues recently released by the OECD see Intellectual property issues in artificial intelligence trained on scraped data.