Insights

Can AI Be Trained With Copyrighted Material?

February 1, 2024

By François Larose and William Audet

A U.S. jury will soon be asked to decide whether using unauthorized copyright-protected works for the purpose of training a machine learning model can constitute fair use (the U.S. equivalent of the fair dealing defence in Canada).

The District Court for the District of Delaware recently denied a motion for summary judgment in a copyright infringement action brought by Thomson Reuters, the owner of the Westlaw legal-research platform, against the startup Ross Intelligence and ruled that the core issues of this case will need to be decided by a jury.

Thomson Reuters initiated its lawsuit in 2020 accusing Ross Intelligence of copying Westlaw’s headnotes and Key Number System to build a competing AI-based legal search tool. After being unable to secure a licence from Westlaw to use its headnotes (short summaries of points of law that appear in court opinions) to train its machine learning model, Ross Intelligence turned to a third-party legal research company to create memos with legal questions and answers, which were created both manually and, at times, using text-scraping bots. These materials were then converted into usable machine learning training data. The “questions” that Ross asked the third-party company to create were meant to be “those a lawyer would ask”, and the answers were direct quotations from court opinions. The idea behind Ross’ natural language search engine was to create a platform where users would enter a legal question and its search engine would then spit out quotations from judicial opinions.

The core of the suit stems from the fact that this third-party company allegedly copied Westlaw’s headnotes to prepare the materials which Ross relied on to train its machine learning model.

Amongst the many questions a jury will need to decide in relation to Ross’ fair dealing defence is whether the alleged reproduction of Thomson Reuters’ works was a form of “intermediate copying”, and thus, was lawful. Intermediate copying involves users copying materials to discover unprotectable information or as a minor step towards developing an entirely new product, with the final output generated by the AI being transformative despite using copied material as an input.

Ross argued that its machine learning model analyzed and studied the headnotes and opinion quotes only to analyze language patterns, and not to replicate the original expression of Westlaw’s headnotes prepared by its attorney-editors. In contrast, Thomson Reuters argued that Ross used the untransformed text and headnotes to get its AI to replicate and reproduce the creative drafting done by Westlaw’s attorney-editors.

We impatiently wait for the upcoming jury trial decision which will be one of the first to study the use of copyright-protected works in the training of AI systems.

As development in AI intensifies, questions surrounding data mining and the use of third-party copyright-protected materials as training data, including for generative machine learning systems, are becoming increasingly prominent, sparking important considerations about intellectual property rights.

A growing number of our clients are reaching out to discuss these key questions and to better understand the Canadian copyright law landscape in regard to artificial intelligence systems, AI-generated content, data mining, and the use of third-party materials as training data. While complex, these questions are ever more important in the rapidly evolving technological landscape.

Contributors

Can AI Be Trained With Copyrighted Material?

Contributors

You may also be interested in:

Subscribe to our newsletter