OpenAI transcribed over a million hours of YouTube videos for GPT-4 training

by alex April 8, 2024

April 8, 2024

The Technology section is published with the support of Favbet Tech

OpenAI транскрибировала более миллиона часов YouTube-видео для обучения GPT-4

According to The New York Times, OpenAI developed the Whisper audio transcription model and transcribed over a million hours of YouTube video to provide high-quality training materials for the GPT-4 model.

The company reportedly knew that such actions were legally questionable and fell into a copyright gray area. However, she considers this a fair use of the materials. OpenAI President Greg Brockman was personally involved in collecting the videos that were used.

OpenAI ran out of useful data in 2021 and discussed transcribing YouTube videos, podcasts, and audiobooks after looking at other resources. By then, the company was training its models on data that included computer code from Github, databases of chess moves, and school assignment content from Quizlet.

OpenAI spokeswoman Lindsay Held said the company curates “unique” datasets for each of its models to “help them understand the world” and stay competitive in global research. In doing so, the company uses “multiple sources, including public data and partnerships for non-public data,” and it is looking to generate its own synthetic data.

Google spokesman Matt Bryant said the company has “seen anecdotal reports” about OpenAI's activities, adding that “both our robots.txt files and Terms of Service prohibit the unauthorized copying or downloading of YouTube content.”

The other day, YouTube CEO Neil Mohan said that using platform data to train the OpenAI model — This is a violation of the terms of use. Google therefore takes “technical and legal measures” to prevent such unauthorized use “if we have a clear legal or technical basis to do so.”

READ

“He sued us when we started to have success without him.” OpenAI responded to Elon Musk's lawsuit

Online course on industrial engineering and effective robots from the Powercode academy. An intensive course for learning how to work with ChatGPT and other tools for professional and special tasks that will help both beginners and professionals. Sign up for a course

According to Times sources, Google also collected transcripts from YouTube. Matt Bryant said the company trained its models “on some YouTube content in accordance with our agreements with YouTube creators.”

Meta also faced limitations in the availability of good training data, and its AI team was discussing unauthorized use of copyrighted works to catch up with OpenAI. After looking at “the nearly accessible English-language books, essays, poems and news articles online,” the company considered steps such as paying for book licenses or even purchasing a major publisher outright. It has also been limited in the ways it can use user data due to privacy-focused changes it made in the wake of the Cambridge Analytica scandal.

OpenAI транскрибировала более миллиона часов YouTube-видео для обучения GPT-4

Favbet Tech is IT a company with 100% Ukrainian DNA, which creates perfect services for iGaming and Betting using advanced technologies and provides access to them. Favbet Tech develops innovative software through a complex multi-component platform that can withstand enormous loads and create a unique experience for players. The IT company is part of the FAVBET group of companies.

No more $20k markups. The latest Toyota Land…

The domestic airliner MS-21 with Russian PD-14 engines…

It will go where Hummer and Land Cruisers…

The most popular electric car in Russia: 200,000…

Tesla is accused of creating a monopoly on…

Bioware veteran has no doubt that Dragon Age:…

Black Myth: Wukong has conquered Steam wishlists. Interest…

The former head of PlayStation gave advice on…

Kotaku's editor-in-chief dedicated her Shadow of the Erdtree…

The PS Store is currently on sale with…

The F1 Arcade restaurant has opened in Boston,…

Not only Fallout: 7 TV series based on…

Apple allows retro game emulators and introduces new…

Hacker attack on Activision users detected

Fully AI-generated games are 10 years away, says…

Up to 100 messages as one – Viber…

Most VPN programs do not work on Copilot+…

First tests of Copilot+ PC ASUS Vivobook S…

Artificial intelligence can detect Parkinson's disease with 100%…

The European Union wants to scan all messages…