The Technology section is published with the support of Favbet Tech
According to The New York Times, OpenAI developed the Whisper audio transcription model and transcribed over a million hours of YouTube video to provide high-quality training materials for the GPT-4 model.
The company reportedly knew that such actions were legally questionable and fell into a copyright gray area. However, she considers this a fair use of the materials. OpenAI President Greg Brockman was personally involved in collecting the videos that were used.
OpenAI ran out of useful data in 2021 and discussed transcribing YouTube videos, podcasts, and audiobooks after looking at other resources. By then, the company was training its models on data that included computer code from Github, databases of chess moves, and school assignment content from Quizlet.
OpenAI spokeswoman Lindsay Held said the company curates “unique” datasets for each of its models to “help them understand the world” and stay competitive in global research. In doing so, the company uses “multiple sources, including public data and partnerships for non-public data,” and it is looking to generate its own synthetic data.
Google spokesman Matt Bryant said the company has “seen anecdotal reports” about OpenAI's activities, adding that “both our robots.txt files and Terms of Service prohibit the unauthorized copying or downloading of YouTube content.”
The other day, YouTube CEO Neil Mohan said that using platform data to train the OpenAI model — This is a violation of the terms of use. Google therefore takes “technical and legal measures” to prevent such unauthorized use “if we have a clear legal or technical basis to do so.”
Online course on industrial engineering and effective robots from the Powercode academy. An intensive course for learning how to work with ChatGPT and other tools for professional and special tasks that will help both beginners and professionals. Sign up for a course
According to Times sources, Google also collected transcripts from YouTube. Matt Bryant said the company trained its models “on some YouTube content in accordance with our agreements with YouTube creators.”
Meta also faced limitations in the availability of good training data, and its AI team was discussing unauthorized use of copyrighted works to catch up with OpenAI. After looking at “the nearly accessible English-language books, essays, poems and news articles online,” the company considered steps such as paying for book licenses or even purchasing a major publisher outright. It has also been limited in the ways it can use user data due to privacy-focused changes it made in the wake of the Cambridge Analytica scandal.
Favbet Tech is IT a company with 100% Ukrainian DNA, which creates perfect services for iGaming and Betting using advanced technologies and provides access to them. Favbet Tech develops innovative software through a complex multi-component platform that can withstand enormous loads and create a unique experience for players. The IT company is part of the FAVBET group of companies.