
The Conversation | Google Deepmind/Unsplash
OpenAI and many other developers refuse to provide precise details about the training data for their models. Research efforts to reverse engineer some of these datasets have also been stymied by copyright takedowns. When errors are found, there is no easy fix. Simple keyword filtering could deal with specific terms such as vegetative electron microscopy. However, it would also eliminate legitimate references (such as this article). More fundamentally, the case raises an unsettling question. How many other nonsensical terms exist in AI systems, waiting to be discovered?