Google Declares It Will Scrape Entire Internet For AI ‘Learning’

  • July 5, 2023
Is Google asserting ownership of the entire Internet? It appears that way, but other AI companies are thinking in the same direction. If data exists anywhere, Technocrats believe they have a right to possess it. Data is the Technocrat’s heroin and like addicts, they will bluster, lie, cheat and steal to feed their habit. ⁃ TN Editor

Google updated its privacy policy over the weekend, explicitly saying the company reserves the right to scrape just about everything you post online to build its AI tools. If Google can read your words, assume they belong to the company now, and expect that they’re nesting somewhere in the bowels of a chatbot.

“Google uses information to improve our services and to develop new products, features and technologies that benefit our users and the public,” the new Google policy says. “For example, we use publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.”

Fortunately for history fans, Google maintains a history of changes to its terms of service. The new language amends an existing policy, spelling out new ways your online musings might be used for the tech giant’s AI tools work.

Previously, Google said the data would be used “for language models,” rather than “AI models,” and where the older policy just mentioned Google Translate, Bard and Cloud AI now make an appearance.

This is an unusual clause for a privacy policy. Typically, these policies describe ways that a business uses the information that you post on the company’s own services. Here, it seems Google reserves the right to harvest and harness data posted on any part of the public web, as if the whole internet is the company’s own AI playground. Google did not immediately respond to a request for comment.

The practice raises new and interesting privacy questions. People generally understand that public posts are public. But today, you need a new mental model of what it means to write something online. It’s no longer a question of who can see the information, but how it could be used. There’s a good chance that Bard and ChatGPT ingested your long forgotten blog posts or 15-year-old restaurant reviews. As you read this, the chatbots could be regurgitating some humonculoid version of your words in ways that are impossible to predict and difficult to understand.

One of the less obvious complications of the post ChatGPT world is the question of where data-hungry chatbots sourced their information. Companies including Google and OpenAI scraped vast portions of the internet to fuel their robot habits. It’s not at all clear that this is legal, and the next few years will see the courts wrestle with copyright questions that would have seemed like science fiction a few years ago. In the meantime, the phenomenon already affects consumers in some unexpected ways.

The overlords at Twitter and Reddit feel particularly aggrieved about the AI issue, and made controversial changes to lockdown their platforms. Both companies turned off free access to their API’s which allowed anyone who pleased to download large quantities of posts. Ostensibly, that’s meant to protect the social media sites from other companies harvesting their intellectual property, but it’s had other consequences.

Twitter and Reddit’s API changes broke third-party tools that many people used to access those sites. For a minute, it even seemed Twitter was going to force public entities such as weather, transit, and emergency services to pay if they wanted to Tweet, a move that the company walked back after a hailstorm of criticism.

Lately, web scraping is Elon Musk’s favorite boogieman. Musk blamed a number of recent Twitter disasters on the company’s need to stop others from pulling data off his site, even when the issues seem unrelated. Over the weekend, Twitter limited the number of tweets users were allowed to look at per day, rendering the service almost unusable. Musk said it was a necessary response to “data scraping” and “system manipulation.” However, most IT experts agreed the rate limiting was more likely a crisis response to technical problems born of mismanagement, incompetence, or both. Twitter did not answer Gizmodo’s questions on the subject.

Read full story here…

Spread the love