The poster's guide on who is selling your data for AI training

By Brian McDonnel February 29, 2024 4 mins read 317 Views

Photo Illustration by Rafael Henrique/SOPA Images/LightRocket via Getty Images

Those Tumblr, Reddit, and WordPress posts you never thought would see the light of day? Yep, them too.

If you’ve ever posted anything on the internet, chances are that your data has already been scraped, collected, and used to train AI systems like the ones powering ChatGPT, Midjourney, and Sora. Generative AI is designed to succeed as a generalist, and learning to do so, OpenAI has said, requires “internet-scale” data to train on.

You probably don’t need me to tell you what happened when companies used scraped public data — often without the permission of those who created it — from news articles, books, and creative projects to teach AI tools how to, say, generate news articles, books, and creative projects.

The New York Times is currently suing OpenAI for allegedly using its expansive archives without permission to train chatbots (in a recent filing, OpenAI accused the Times of hiring “someone to hack” ChatGPT in order to prove that the chatbot was stealing their content). Getty Images sued Stable Diffusion for copyright infringement. Other lawsuits from authors and creators, angry to find that their works were used to train AI models, have faced setbacks in court.

Other companies have decided to make deals. The Associated Press has licensed part of its archives to OpenAI. Shutterstock, the stock photo archive, has signed a six-year deal with OpenAI to provide training data, which includes access to its photo, video, and music databases.

The ways AI systems use the work of journalists, musicians, and photographers have pretty consequential implications for our information and cultural ecosystem and for the people who work in the fields that AI companies seem dead-set on developing tools to replace. The need to gather more and more training data with as little fuss as possible means that anyone who’s an online poster — whether its a fandom Tumblr account, an active Reddit presence, or a personal blog — could see access to their content being sold by the platforms hosting it to one of these big AI companies.

Below is a quick guide to what we know right now about who might be selling your best posts as training data.

Tumblr and WordPress.com

Earlier this week, 404 Media reported that Automattic, the parent company for Tumblr and WordPress, was preparing to announce deals selling user data to OpenAI and Midjourney. According to 404’s reporting, which describes such a deal as “imminent,” the data seems likely to include user posts on Tumblr and on WordPress.com. On Wednesday, a day after 404’s report, Automattic announced a way for users to opt out of sharing their public content with third parties.

The Tumblr staff announcement on the change framed the whole thing as a sign that the company was working to protect its users. “We already discourage AI crawlers from gathering content from Tumblr and will continue to do so,” the announcement read, “save for those with which we partner.”

Automattic said in a statement that it was “working directly with select AI companies as long as their plans align with what our community cares about: attribution, opt-outs, and control,” but has not provided any further information on the reported deals with OpenAI and Midjourney.

Although Tumblr’s cultural heft has waned over the past decade, it’s still a pretty important platform for fandom content, including fanfiction and fan art. There are also plenty of artists who use Tumblr to host their original work and take commissions.

Reddit

Reddit’s enormous archives of posts are driven by the labor of volunteers: Unpaid subreddit moderators oversee communities of unpaid users. Their collective efforts on Reddit make the platform valuable.

So when Reddit announced that it was launching an IPO, the company reached out to a selection of mods and frequent posters to offer them the opportunity to buy stock early. Some of those who received the offer were not super enthusiastic about it. But Reddit does not need buy-in from its users to profit from their work: It has already sold access to their posts to Google.

Just before the IPO announcement, Reddit and Google entered into a $60 million deal that would give Google access to Reddit’s API in order to, among other things, train its generative AI models.

Everything else, to be honest

The reported deals above are just a couple that have become public. But this doesn’t mean that large AI models aren’t already being trained on your posts across the internet.

Last year, the Washington Post examined one of the massive data sets of scraped public internet data used to train generative AI models and found everything from World of Warcraft message boards to Patreon and Kickstarter and several huge repositories of personal blogs. And it should not be a surprise that Meta uses public posts from Facebook and Instagram to train its AI models.

----------------------------------------

By: A.W. Ohlheiser
Title: A poster’s guide to who’s selling your data to train AI
Sourced From: www.vox.com/technology/24086039/reddit-tumblr-wordpress-whos-selling-your-data-to-train-ai
Published Date: Thu, 29 Feb 2024 12:00:00 +0000

poster data

The Ultimate Guide to Digital Marketing in 2025: Predictions from Our Elite Coaches

November 20, 2024 2,924 Views

How Tagging Strategies Transform Marketing Campaigns

July 31, 2024 3,631 Views

Navigating the Video Marketing Maze: Short-Form vs. Long-Form

July 24, 2024 3,898 Views

Why The Sales Team Hates Your Leads (And How To Fix It)

July 22, 2024 3,970 Views

Battling for Attention in the 2024 Election Year Media Frenzy

July 18, 2024 3,857 Views

Popular Dominican Republic Beach Battles Sargassum Invasion Along With Trash Problem

June 16, 2024 2,599 Views

The Ultimate Guide to Digital Marketing in 2025: Predictions from Our Elite Coaches

How Tagging Strategies Transform Marketing Campaigns

Navigating the Video Marketing Maze: Short-Form vs. Long-Form

Why The Sales Team Hates Your Leads (And How To Fix It)

Battling for Attention in the 2024 Election Year Media Frenzy

The poster's guide on who is selling your data for AI training

Latest Posts

The Ultimate Guide to Digital Marketing in 2025: Predictions from Our Elite Coaches

How Tagging Strategies Transform Marketing Campaigns

Navigating the Video Marketing Maze: Short-Form vs. Long-Form

Why The Sales Team Hates Your Leads (And How To Fix It)

Battling for Attention in the 2024 Election Year Media Frenzy

Popular Dominican Republic Beach Battles Sargassum Invasion Along With Trash Problem

Categories

Trending Posts

Most Popular

Pumpkin Seeds and the Benefits They Offer

Whoopi Goldberg slammed LeBron James for declaring that it was not my job to endorse the COVID vaccine

Ireland is Boosting Tourism with U.S. Round-Trip tickets starting at $359 and Unbeatable Tour Deals

Highest Rating

Notable and New: What I Read This Week - Edition 146

Are Nightshades bad for you?

NFL Week Three Predictions and Previews

Popular Tags

Newsletter

The poster's guide on who is selling your data for AI training

Share This

Latest Posts

The Ultimate Guide to Digital Marketing in 2025: Predictions from Our Elite Coaches

How Tagging Strategies Transform Marketing Campaigns

Navigating the Video Marketing Maze: Short-Form vs. Long-Form

Why The Sales Team Hates Your Leads (And How To Fix It)

Battling for Attention in the 2024 Election Year Media Frenzy

Popular Dominican Republic Beach Battles Sargassum Invasion Along With Trash Problem

Categories

Trending Posts