Main menu


ChatGPT stole your job. Then what you will do?

featured image

if you already uploaded photos or art, wrote a review, “liked” content, answered a question on Reddit, contributed to open source code, or performed any number of other activities online, you did free work for tech companies, why download all that content of the web is how its AI systems learn about the world.

Tech companies know this, but they mask their contributions to their products with technical terms like “training data”, “unsupervised learning” and “data exhaustion” (and, of course, impenetrable “Terms of Use” documents ”). In fact, much of the AI ​​innovation in recent years has been about ways to use more and more content for free. This is true for search engines like Google, social media sites like Instagram, AI search startups like OpenAI, and many other smart technology providers.

This exploitation dynamic is particularly damaging when it comes to the new wave of generative AI programs such as Dall-E and ChatGPT. Without your content, ChatGPT and all its ilk would simply not exist. Many AI researchers feel that their content is actually more important than what computer scientists are doing. However, these smart technologies that exploit your work are the same technologies that threaten to put you out of your job. It’s as if the AI ​​system entered your factory and stole your machine.

But that dynamic also means that the users who generate data have a lot of power. Discussions about using sophisticated AI technologies often come from a place of powerlessness and the posture that AI companies will do what they want, and there is little the public can do to change the technology in a different direction. We are AI researchers, and our research suggests that the public has an enormous amount of “data leverage” that can be used to create an AI ecosystem that generates amazing new technologies and shares the benefits of those technologies fairly with the people who use them. created them.

Data leverage can be implanted through at least four routes: direct action (e.g., individuals banding together to retain, “poison” or redirect data), reregulatory action (e.g. pushing for data protection policies and legal recognition of “data coalitions”), legal action (for example, communities adopting new data licensing regimes or taking legal action) and market action (for example, requiring large language models to be trained only on data from consenting creators).

Let’s start with direct action, which is a particularly exciting path because it can be done right away. Due to the reliance of AI generative systems on extracting the web, website owners can significantly disrupt the training data pipeline if they prohibit or limit extraction by configuring the robots.txt file (a file that tells web crawlers which pages are out of bounds).

Large user-generated content sites such as Wikipedia, StackOverflow and Reddit are particularly important for generative AI systems and can prevent these systems from accessing their content in even stronger ways – for example by blocking IP traffic and API access . According to Elon Musk, Twitter recently did just that. Content producers should also take advantage of the opt-out mechanisms increasingly provided by AI companies. For example, programmers on GitHub can disable BigCode training data through a simple form. More generally, simply being vocal when content has been used without your consent has been somewhat effective. For example, leading generative AI player Stability AI has agreed to honor opt-out requests collected via following an uproar on social media. By engaging in public forms of action, as in the case of mass protest against AI art by artists, it may be possible to force companies to cease commercial activities that most of the public perceive as theft.