As we all know deep learning needs training data, tons of data. But how much training data OpenAI actually needed for ChatGPT? Well: more than was being available directly!
It turns out that during the data collecting phase OpenAI exhausted the internet, meaning all texts written down in English on the WWW were not enough.
How did they solve it? They've created Whisper, a audio to text transcription speech recognition tool. Then they've scraped roughly 1 million hours of English Youtube Videos, and fed that data to their learning system. Whether this was compliant to Google's ToS of that time or not is debatable. Google internally also used that approach for their LLM, which might violate the copyright of the content creators.
This also leads to all kinds of legal trouble; and future generations are supposed to use even more data, so they are already considering transcribing radio, podcasts and so on.
The race to lead AI has become a desperate hunt for the digital data needed to advance the technology. To obtain that data, tech companies including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law, according to an examination by The New York Times.
economictimes.indiatimes.com