TLDR:
– OpenAI previously claimed it was impossible to train leading AI models without using copyrighted materials
– New evidence suggests that large language models can be trained ethically without copyrighted content
The article discusses recent developments that challenge the belief that training AI models requires the use of copyrighted data. Previously, OpenAI and other leading AI players relied on copyrighted materials to train their models, leading to lawsuits alleging copyright infringement. However, a group of researchers backed by the French government have released a large AI training dataset composed entirely of public domain text, demonstrating that it is possible to train AI models without using copyrighted materials. Additionally, Fairly Trained, a non-profit organization, has certified its first large language model, KL3M, which was developed by a legal tech consultancy startup using a curated dataset of legal, financial, and regulatory documents that comply with copyright law.
The article highlights the importance of ethically training AI models on data that the companies own, have licensed, or that is in the public domain. By curating datasets and avoiding the use of copyrighted materials, companies can create specialized and effective AI models without the risk of copyright infringement lawsuits. This approach not only ensures ethical practices in AI development but also offers a selling point to users who are increasingly seeking fair and legally created AI models.