The concept of the AI web scraping entered the public consciousness very recently, even by the tech world standards. It happened somewhere between the Cambridge Analytica scandal in 2018 and the time when ChatGPT went mainstream in 2022.
Since then, we have learned one simple truth: the AI must get its knowledge from somewhere. It must be trained. It must absorb large quantities of information. And what is the easiest way to obtain information?
That’s right, by collecting data from the internet, or “scraping the web”.
As a pro-privacy tech company, we have been keeping a very close eye on this problem. Here is what we have to say.
AI: The Fastest Reader on the Web
The main problem with web scraping is that the AI can scrape anything. It could be Wikipedia – or your tax records. It could be your voice and facial data.
- in 2019, IBM was caught scraping a million images from Flickr
- Clearview AI has amassed more than 20 billion (!) photos from the web, including social media (and sought every US mugshot from the past 15 years)
- Midjourney scraped ‘a hundred million’ artworks without the artists’ permission
- Stable Diffusion is being sued for scraping millions of files from Getty Images
- around 2014, Cambridge Analytica collected 50 million Facebook accounts to profile users and influence the US presidential election
And the list goes on.
The thing is, most AI scraping goes unnoticed. No one is held accountable. Of course, the law protects your personal data to an extent, but you’re not going to know you’ve been “scraped”.
Wait, So the AI = Evil, Right?
The AI web scraping is not a bad thing, quite the opposite. You do it too when looking for the best deal in your area or booking a flight.
When done ethically, automated web scraping is a great way to improve a neural network, which can help many people in the future.
However, anyone can write a piece of code and scrape the web. It’s fast. It’s simple. It’s our new reality.
There are 15-minute tutorials on the subject.
Your Facebook photo can be used by a compsci student for their semester project – or by a scammer who wants to construct a deepfake.
Let’s take voice data. It can be obtained from videos and sound files. You need around a minute of good quality audio to mimic a voice. We can expect the technology to grow, and a few seconds might be enough.
In the wrong hands, your voice can be cloned to arrange a spoofing scheme, steal your identity, ruin your reputation, identify you on the internet, or simply run pranks. Neither option sounds good.
Even With AI Web Scraping, Privacy Is Achievable
The brute force solution is: don’t put your voice or face on the internet. Ever.
But… come on.
Going “stealth” is rarely feasible. To many users, getting scraped by the AI would be preferable to deleting their photos. Most people need their online presence, either for work, business, or personal reasons.
The good strategy is to keep your sensitive data off the public internet. The AI might record your YouTube travel blog, but it won’t get your travel schedule, encrypted in a secure note taking app.
As a company, we have invested years into making wearables that can record sensitive data. We have gone to great lengths to protect our users. Data scraping has been another important challenge to solve. Luckily, our team is very good at encryption. Company policy is another crucial part of the effort.
- secure software
- secure hardware
- appropriate policy
These are the key elements that ensure you will be safe from data scraping. Choose the apps to trust based on their encryption and transparent policies.
Web scraping is not the end of the world. It can be controlled, harnessed, and avoided. In the end, it’s not really the AI that’s collecting data but the humans behind the machine.
This article has been brought to you by Senstone. Our voice-to-text recorder is a leading solution on the productivity market. Learn more about our products here. To continue reading our blog, head over here. Stay secure – and stay awesome!