TLDR:
- AI language models like ChatGPT could run out of publicly available training data by 2026-2032, leading to a potential bottleneck in AI development for companies like OpenAI, Google, and Meta.
- The growing demand for data to train AI models could result in companies resorting to using private data sources or relying on less-reliable synthetic data generated by AI systems themselves.
Artificial intelligence systems like ChatGPT could soon face a bottleneck in their development due to exhausted sources of publicly available training data, a new study by research group Epoch AI suggests. The study projects that there might not be enough new human-generated text data to sustain the current pace of AI development by the turn of the decade.
In the short term, companies like OpenAI and Google are racing to secure high-quality data sources to feed their AI large language models. This includes tapping into sources like Reddit forums and news media outlets. However, in the longer term, there might not be enough new blogs, news articles, and social media commentary available to continue the current trajectory of AI development.
The study highlights a potential bottleneck in AI development, where companies may have to resort to using sensitive private data or relying on synthetic data created by the AI systems themselves. The concern is that once companies reach a limit on available data, it could hinder the efficiency of scaling up AI models to improve their capabilities and output quality.
Some researchers suggest that training specialized AI models for specific tasks could be an alternative to constantly scaling up models with larger datasets. However, concerns exist around overtraining AI systems on the same data sources, leading to degraded performance and potential encoding of existing mistakes and biases in the information ecosystem.
In conclusion, the study points to a looming challenge in AI development as companies may soon face a scarcity of high-quality text data to train their AI systems. This could prompt a shift towards using private data sources or relying on synthetic data, posing potential risks to the quality and efficiency of AI models in the future.