top of page

Navigating Copyright Issues in Data Mining

  • Writer: Aequitas Victoria
    Aequitas Victoria
  • 3 days ago
  • 10 min read

Paper Code: AIJACLAV02RP2025

Category: Research Paper

Date of Publication: May 19, 2025

Citation: Dr. Shiva Satish Sharda, “Navigating Copyright Issues in Data Mining", 5, AIJACLA, 16, 16-22 (2025), <https://www.aequivic.in/post/navigating-copyright-issues-in-data-mining>

Author Details: Dr. Shiva Satish Sharda, Assistant Prof of Law, RGNUL



Abstract

The name "TDM" was first used by Marti A. Hearst in 1999. The first step in building an AI model is gathering training data, also known as TDM. It entails gathering large amounts of digital information in a methodical way and then using software to evaluate and extract useful information from this collection. Web scraping, crawling, and archiving are all part of this process. "New technologies that enable the automated computational analysis of information in digital form, such as text, sounds, images or data," is how the EU Directive on Copyright defines TDM. A new body of knowledge and emerging trends can be uncovered through the processing of massive amounts of data made feasible by TDMs, as per the EU Directive. This post focuses on TDM applied for training AI models, while TDM finds applicability in various non-AI domains as well. Rapid technological advancements, such as artificial intelligence and generative AI, are posing a threat to human creativity. After discovering what it believed to be copyright infringement in December 2023, the New York Times took legal action by suing Open AI and Microsoft in the US. The New York Times' main argument is that chatbots trained with millions of its articles are now competing with it. This lawsuit is just one of many that have been filed by various authors and artists who allege copyright infringement by Open AI. Japan, Singapore, and the European Union have made limited exceptions to their copyright rules to allow text and data mining, but the US legislation on using in-copyright works in training data is still evolving. Nearer to home, an important question arises: how can India's copyright law strike a compromise between protecting copyright works and facilitating AI and machine learning? The government of India (GOI) is confident that copyright rules adequately handle concerns about AI-generated works and related advancements, according to a recent press statement. This analytical write up focus on the legal framework of India applicable on text and data mining and its effectiveness to deal with current legal challenges posed by the generative AI.

 

Keywords: Artificial Intelligence, Intellectual Property, AI-generated Works, Legal Implications, IP Law Reform.


Copyright Issues Involving TDM

TDM entails the transformation of data into valuable knowledge by processing it to identify patterns and trends. This process can be likened to the process of panning for gold, which necessitates a substantial amount of time and effort to locate even a small amount of metal. In the same way, the concept of da-ta mining remains unchanged if we substitute gold with information and panning with algorithms. In the same way that gold mining entails the extraction of precious nuggets from immense quantities of rock, TDM endeavours to extract valuable information from large datasets. Nevertheless, TDM processes may result in copyright issues if the dataset contains copyrighted original work, such as original text, photographs, or videos. In addition, database producers who have compiled databases are safeguarded under copyright law if the data is creatively selected or arranged to accomplish the original database. 

The data mining process entails the accessing, collection, storage (copying), transformation, and transformation of original works. The input phases of generative AI can be divided into three stages from a TDM perspective: data access (step 1), data extraction and reproduction (step 2), and data mining and knowledge-edge discovery (step 3). Legal issues are most likely to arise during the second phase.[1] 

Gaining access to the data, which can be either digital or analoge, contained within physical objects, is the initial stage, known as "data access." This procedure is implemented prior to the replication of the data for TDM purposes. It is a pre-processing procedure that is necessary for the actual data collection. The data provider may exercise access control to ensure that the rights holders receive pecuniary compensation for the use of their works in this step. If the data provider is also the copyright holder, the access controller is typically the copyright holder. Nevertheless, the database producer, who organizes the works into a database and makes them accessible to the user, may also regulate the user's access to the data, subject to the original copyright holder's rights. Contractual control and technological protection measures (TPM), including registration and authentication systems, are the two primary methods of access control. The data provider has the capacity to limit the availability and scope of TDM activities, in addition to access, during the 'access' step. [2]

 

Subsequently, the 'extraction' phase entails the collection of data, including works, for TDM and its subsequent preparation for analysis. This phase may involve the replication of substantial amounts of data using a variety of techniques. Examples of this include the primary replication of the data during collection, its reproduction as the primary copy is transformed into input data for TDM processing, and additional reproduction for backup or verification purposes. The most direct legal conflicts with the exclusive rights of intellectual property holders frequently occur during this extraction phase, particularly when TDM activities infringe upon these exclusive rights. [3]

The initial two stages involve the preparation of the input data for TDM purposes. The core of TDM technology is "mining," which involves the extensive analysis of the data to extract meaningful information. AI algorithms analyze the data at this stage, which leads to temporary reproduction. The data may be replicated across multiple servers to facilitate distributed processing during the analysis of large quantities of data. [4]

Today, the digital infrastructure on which data can be located is the extent of the reach of DM technology, and its effects on our daily lives are ubiquitous. Data retrieval and archival have been simplified as a result of electronic data storage, which has opened a new door to DM. The open-source nature of web robots has also reduced any obstacles to the adoption of DM technology. Diverse stakeholders, including data proprietors ("owners") and data miners ("miners"), are the primary drivers of DM, which is based on its readily accessible technological infrastructure. Before miners can mine or acquire data for analysis, owners must first make their data available on the Internet to initiate DM. Subsequently, DM may be implemented for commercial purposes (e.g., companies may implement DM to customize promotions to customer profiles for targeted advertising) or for non-commercial purposes (e.g., academic research). [5]

Certain proprietors who wish to commercially exploit their data (which may be contained in works protected by copyright) may implement technological measures to prevent unauthorized access to their data. 


The application and benefits of TDM in India are being increasingly recognized and utilized in a variety of sectors. Justice K.S. Puttaswamy (Retired) vs Union of India and Ors.[6] was a landmark case in which the Supreme Court of India acknowledged the significance of TDM in relation to privacy concerns. The Court determined that the state has a legitimate basis for requiring the acquisition of authentic data, as it is engaged in data mining to ensure that resources are appropriately allocated to legitimate beneficiaries. Nevertheless, India has not yet experienced any specific litigation or established precedents in relation to TDM in the context of copyright. The absence of specific text and data mining exceptions in India raises concerns regarding the justification of actions within the fair dealing framework, as outlined in Section 52 of the Copyright Act, 1957.[7]

The Delhi High Court correctly observed in The Chancellor, Masters & Scholars of the University of Oxford & Ors. v. Rameshwari Photocopy Services & Ors.[8] (also known as the Delhi University photocopying case) that the copyright law is intended to harmonize the rights of copyright holders and consumers. Consequently, it was determined that "the rights of individuals mentioned in Section 52 are to be interpreted in accordance with the same principles as the rights of a copyright owner and are not to be interpreted narrowly or strictly in order to avoid reducing the scope of Section 51, as is the rule of interpretation of statutes in relation to provisos or exceptions."[9]


Increasingly, the application and benefits of TDM in India are being acknowledged and acknowledged. In India, the Court considered factors such as the "purpose and character of the use" and the "amount and substantiality of the portion used" to implement fair dealing under Section 52 of the Act: In general, the fundamental concept that courts evaluate is whether the purpose of the subsequent work and the prior work is substantially the same or substantially different. "If it is different, there will be no fair dealing." Lord Denning stated, "It is impossible to define what is 'fair dealing.'" It must be a matter of degree. Due to the absence of a statutory definition, the courts are responsible for determining the meaning of "fair dealing" on a case-by-case basis. Despite the fact that Section 52 specifies specific circumstances, the Indian courts have interpreted it in a manner that has rendered the doctrine applicable in other cases. Interpretation of Section 52 of the Act The legislative intention behind Section 52 of the Act, as indicated by a variety of cases, is to safeguard the rights of the general public in India. Section 52 of the act is intended to safeguard the rights outlined in Article 19(1) of the Indian Constitution, as the Court declared in the Wiley Eastern Ltd. case. The objective of this Section is to encourage research, private study, review, and criticism. The 2012 amendment to Section 52 of the Copyright Act, which broadens the scope of this section, demonstrates that the scope is extremely broad and can be expanded in accordance with societal requirements. The 4-step test has been taken into account by Indian courts in numerous instances, with reference to US cases. The equitable dealing case was resolved by the Indian courts in accordance with the nature of the issue. In contrast to the United States, Section 52 of the Act specifies the circumstances in which equitable dealing will be applicable. However, it has been interpreted very liberally by the court. Since the equitable dealing principle is not specific, it may be applicable in relation to TDM. Nevertheless, the TDM cannot be exclusively regulated by the fair dealing principle, as copyright is only one component of the issue. Fair dealing is multidimensional, not unidimensional, and should be viewed from the perspective of all three stakeholders: the authors, the proprietor, and the public. This liberal approach allows for the TDM to be governed under Section 52 of the Act. India has a significantly higher incidence of copyright infringement than other countries. India is ranked 43rd among the 53 countries in terms of the enforcement of intellectual property rights, which encompasses infringement, the civil and criminal legal procedures available to copyright holders, and the authority to conduct border controls and inspections by customs officials.[10]

 This suggests the presence of copyrighted material that has been infringed upon. Furthermore, in certain instances, it was determined that the user's use of an infringed copy is not considered bad faith if the user acts rationally and believes that it is a "fair use." Fair use would be established when the infringed subject is employed for a transformative purpose. This may present obstacles, as developers may attempt to exploit it to circumvent the copyright owner's authorization. This has the potential to encourage the unauthorized use and sale of copyrighted subject matter in the country. The critical inquiries that must be addressed are as follows: What is the current status of the copyright holder's exclusive economic rights in relation to their use for non-commercial purposes? What is the current stance on the transient storage of the data? What works would be considered equitable dealing under Section 52 of the Act, and which would not? If the reproduction is made in violation of the Act's provisions, it is considered an infringement under Section 2(m) of the Act. This applies to dramatic, literary, musical, and artistic works. The storing of any work in any electronic medium, including the incidental storage of any computer program that is not an infringing copy under the Act, does not constitute infringement under Section 52 if the copyrighted material is used for personal use, including research.[11] Furthermore, Clauses (b) and (c) of the Act do not consider the incidental or transient storage of work/performance for the purpose of any technical electronic transmission/communication process to the public to be an infringement. Therefore, the act of temporarily storing the data to execute the action authorized by Section 52 of the Act does not constitute an infringement. As a result, the use of the copyrighted work for no commercial purpose, which is governed by Section 52 of the Act, may not constitute an infringement. Case-by-case, the response to the final inquiry would vary.

The Court in the case of Authors Guild v. Google decided that Google Books intended to provide buyers with substantial information about the books, based on the "need" for the infringement and the "literal necessary" to accomplish the result. The Court in A.V. v. iParadigms [12] has relied on the purpose of maintaining versions of the copyrighted material. The purpose is not an infringement if it is necessary to carry out the operation and is not mala fide. It is possible that the courts may adopt comparable strategies in the event of a dispute regarding TDM. The protection should not be extended to the idea behind the protected work, as it is a settled principle in India that there are no copyrights associated with the expression or idea. TDM activity is a non-expressive use of the copyrighted work. The High Court in the case of Kartar Singh Giani determined that the term "fair" in the context of fair dealing is connected to two factors. The first is "an intention to compete and to derive profit from such competition," while the second is the "motive of the infringer. In the TDM case, neither of these two elements is present. Therefore, in comparable scenarios, the use of TDM without consent may not result in copyright infringement. The Court may require years to resolve these issues, and the AI industry is currently in a phase that necessitates their resolution in order to make TDM feasible without any complications. Therefore, the legislature or executive must intervene in a variety of sectors to address all of these factors.

 


[1] Kyungsuk Kim, ‘Korean Copyright Issues in Text Data Mining for Generative AI’ <https://aire.lexxion.eu/data/article/19398/pdf/aire_2024_01-009.pdf> accessed 18 February 2025.

[2] Geon Heo Yuji Roh, ‘Yuji Roh, Geon Heo, Steven Euijong Whang, Senior Member, IEEE’ <http://arxiv.org/pdf/1811.03402> accessed 31 January 2025.

[3] Yuji Roh, Geon Heo and Steven Euijong Whang, ‘A Survey on Data Collection for Machine Learning A Big Data-AI Integration Perspective’.

[4] ibid; ‘AIRe - Journal of AI Law and Regulation: Korean Copyright Issues in Text Data Mining for Generative AI’ <https://aire.lexxion.eu/article/AIRE/2024/1/8> accessed 18 February 2025.

[5] ‘AIRe - Journal of AI Law and Regulation: Korean Copyright Issues in Text Data Mining for Generative AI’ (n 4).

[6] ‘JUSTICE K.S. PUTTASWAMY VS. UNION OF INDIA - South Asian Translaw Database - PRIVACY’ <https://translaw.clpr.org.in/case-law/justice-k-s-puttaswamy-anr-vs-union-of-india-ors-privacy/> accessed 19 February 2025.

[7] David Tan and Thomas Chee Seng Lee, ‘Copying Right in Copyright Law: Fair Use, Computational Data Analysis and the Personal Data Protection Act’ (2021) 33 Singapore Academy of Law Journal <https://heinonline-org.rgnul.remotexs.in/HOL/Page?handle=hein.journals/saclj33&id=606&div=29&collection=journals> accessed 18 February 2025.

[8] ‘University of Oxford v. Rameshwari Photocopy Service’ <https://www.theipmatters.com/post/university-of-oxford-v-rameshwari-photocopy-service> accessed 19 February 2025.

[9] Jef Ausloos, Rene Mahieu and Michael Veale, ‘Getting Data Subject Rights Right’ (2019) 10 Journal of Intellectual Property, Information Technology and Electronic Commerce Law <https://heinonline-org.rgnul.remotexs.in/HOL/Page?handle=hein.journals/jipitec10&id=292&div=30&collection=journals> accessed 18 February 2025.

[10] ‘TDM and Copyright: Balancing Innovation and Legal Compliance - Lexology’ <https://www.lexology.com/library/detail.aspx?g=629e1c83-aa75-4d8c-8367-34c8bf5f233b> accessed 19 February 2025.

[11] kailash chauhan, ‘Generative AI, Text & Data Mining and the Fair Dealing Doctrine: Examining the New Problem with the Old Regime’ (2025) 30 Journal of Intellectual Property Rights (JIPR) 77 <https://or.niscpr.res.in/index.php/JIPR/article/view/12652> accessed 19 February 2025.

[12] ‘A.V. v. IParadigms - Google Search’ > accessed 19 February 2025.

Recent Posts

See All
bottom of page