Understanding the Text and Data Mining Exception to Copyright Laws: Implications for Training Large Language Models


In the rapidly evolving field of artificial intelligence, the training of Large Language Models (LLMs) has become a focal point for innovation. These models, which power applications like chatbots, translation services, and content generation, rely heavily on vast amounts of text and data. However, the legal landscape surrounding the use of copyrighted materials for text and data mining (TDM) can be complex. This article explores this topic.


What is TDM?

TDM involves the automated processing of large datasets to discover patterns, trends, and other valuable information. For training LLMs, TDM is essential as it allows AI systems to learn from extensive collections of text, improving their ability to understand and generate human language.


TDM Exceptions to Copyright Laws

Different countries have established various exceptions to copyright infringement that permit TDM under certain conditions. Understanding these exceptions is crucial for AI developers, researchers, and businesses. Here’s a breakdown of TDM exceptions in key regions:

European Union

The European Union has taken significant steps to facilitate TDM while protecting copyright holders. The Directive on Copyright in the Digital Single Market includes two key exceptions:

  1. Article 3: Allows TDM for scientific research by research organizations and cultural heritage institutions, provided they have lawful access to the works.
  2. Article 4: Permits TDM for any purpose, as long as the user has lawful access to the content and the rights holder has not explicitly reserved their rights.

United Kingdom

The UK, following Brexit, maintains a similar stance with specific TDM provision in the Copyright, Designs and Patents Act 1988Section 29A: Allows TDM for non-commercial research if the researcher has lawful access to the material.


Japan’s Copyright Act provides a broad exception for TDM. Article 30-4 allows the reproduction of copyrighted works for data analysis (TDM) for any purpose, including commercial use, regardless of the rights holder’s reservations or TDM user having lawful access.

United States

The U.S. relies on the fair use doctrine rather than a specific TDM exception. Key considerations include:

  1. Purpose and character of use: Non-commercial research and transformative uses are more likely to be considered fair use.
  2. Nature of the copyrighted work: Use of factual works is more likely to be fair use than highly creative works.
  3. Amount and substantiality: Using only the amount necessary for research supports a fair use claim.
  4. Effect on the market: If the use does not significantly affect the market value of the original work, it is more likely to be fair use.


Singapore has provisions for TDM under its Copyright Act. Sections 243 & 244 permits TDM by commercial and non-commercial organisations, provided the researcher has lawful access to the material.


Relevance to Training LLMs

Training LLMs involves processing extensive text datasets, often including copyrighted material. TDM exceptions enable AI developers to legally utilize these datasets, fostering innovation while respecting copyright laws. Here’s how these exceptions impact LLM training:

  • Access to Data: Legal TDM exceptions provide AI developers with access to vast amounts of data necessary for training sophisticated models.
  • Research and Development: Non-commercial research exemptions support academic and non-profit research initiatives, driving advancements in AI technology.
  • Commercial Applications: Commercial TDM exceptions enable businesses to develop and deploy AI applications without infringing on copyright laws, promoting industry growth.


Join Us at TechLaw.Fest 2024 to Explore More

Understanding the legal nuances of TDM exceptions is vital for anyone involved in AI development. Our upcoming conference, TechLaw.Fest, will offer in-depth discussions on this topic, featuring insights from legal experts, AI researchers, and industry leaders. Don’t miss this opportunity to stay ahead in the world of generative AI.

Register now to secure your spot.