November 10, 2003

Google spyware

I just had an interesting experience, which I'm not 100% thrilled with.

Over on IRC, I made a casual reference to a URL. http://www.sidhe.org/backgrounds/gif_for_folks_who_link_unasked.gif specifically. It's a GIF I swap in for the real image when I find folks linking to images in the background collection. (For the record, it's the background image from the nifty.org archive of gay/lesbian/bi fiction (though work-safe) and has the twin advantages of being very light (so it tends to obscure white text) and subtly subversive, while still being tasteful)

It's not that I care that people look at the backgrounds, nor that they grab them for their own use. I found them ages ago in a free collection of web graphics and threw them up there myself for my own use, and I'm certainly not going to complain when other folks grab them and use them--go for it and good luck. I'd just rather you use the copy on your server, not mine. :)

Anyway, I made an offhand comment on IRC with the URL for the real image (I hardlink it on the server, rather than playing Apache redirect games) and three minutes later... Pow! There was googlebot, looking at it. Turns out that one of the folks on the channel'd looked at it, and they run Opera with the google stuff on it so presumably that's how google got the URL. I will admit to being very impressed with the speed that the crawler struck out with (and it was the crawler, at least according to the log data and the PTR record for the IP address) but still...

Makes me wonder how much interesting info Google;s got indexed that's on public webservers but not linked to, because someone's doing work with a browser that's got the google plugins. Yet another argument for not putting private data on public webservers, no matter how obscure the URL is. (As if we really needed another)

Also makes me wonder how long before Google starts snagging the contents as seen in your browser in case the pages it tries to crawl aren't accessible...

Posted by Dan at November 10, 2003 01:07 PM | TrackBack (4)
Comments

Google is apparently doing some direct indexing of IRC:

http://manero.org/weblog/archives/000133.html

(Look for the message from the Google Team.)

Posted by: Michael S. at November 10, 2003 06:41 PM

I was somewhat disconcerted when the Apache::MP3 server that I'd stuck up on my home box so I could listen to stuff at work got indexed by Google. Which will teach me for mentioning it to someone on IRC I guess.

Posted by: Piers Cawley at November 11, 2003 05:45 AM

This might be of interest, the guy works for Google.

http://www.richardm.co.uk/blog/?id=8

Posted by: Frank at November 11, 2003 06:17 AM

Sheesh. I hear things like this all the time: "I signed up for AdWords and 10 minutes later, Googlebot crawled me!" The fact is that it would be more of a coincidence Googlebot never crawled a page right after it got mentioned in an email newsletter, or on IRC, or on CNN, or whatever; so color me extremely skeptical.

I wish more bloggers were savvy webmasters. For example, why use the script on richardm's page when you can just make a small change to your robots.txt?

User-agent: Googlebot
Disallow: /*.gif$

I found that example directly on Google's webmaster info pages. Anyway, I'd be more convinced if you repeated the experiment at least once. Make a new image with a weird name that has absolutely no links to it. Then post the url in the IRC channel again, and tell us what happens. :)

Posted by: PaulR at November 11, 2003 01:22 PM

I certainly don't care that google's crawling and indexing the images here -- I'm more than capable of tweaking a robots.txt file if I need to, and I'm pretty comfortable with how crawlers work. (I didn't work at a search engine for two years without learning something... :)

The image was definitely new and definitely only mentioned on-channel, at least until I put it up in the blog. Mentioning other new, unlinked image URLs in-channel hasn't produced the same result when viewed with non-googlebar-enabled browsers.

The timing is... interesting. Given what's lurking on the network and who looked at the URL before google, either there's a hidden server on the rhizomatic IRC net that's linked into Google or there's spyware feeding URLs to Google's crawler in the googlebar plugin for Opera.

Posted by: Dan at November 11, 2003 01:53 PM

If I understand your post, you conclude that only one of two things could have happened: either a hidden server on IRC, or spyware in Opera. #1 strikes me as pretty implausible; guess it's possible. As far as #2, it seems like Opera is pretty straightforward about how exactly their stuff works:
http://www.opera.com/privacy/ads/
http://www.opera.com/docs/ads/
"All of the code in the Opera browser was written by the developers of Opera Software ASA. This includes the code written to implement the specific banner-serving functionality in the browser. Much care was taken with this development so that user privacy and security would not be compromised."

So I'm inclined toward alternative #3: someone did a link to that url. If it's a GIF that you swap in on people that steal your images, I wouldn't be surprised if someone noticed and linked to it directly--seems like your backgrounds page is pretty popular. I'm too lazy to install Opera or get on IRC and run the experiments myself, so I'll have to take your word for it. :)

Posted by: PaulR at November 11, 2003 02:17 PM

Or... it could be #2. :)

I put up a brand-spanking-new image, completely unlinked to anything. Downloaded Opera for Linux, fired it up, told it that I was OK with google, and looked at the new URL. Then I shut down Opera.

5 minutes 38 seconds later... here comes googlebot!

Posted by: Dan at November 11, 2003 02:22 PM

As I see folks have linked to this post, I'll point you at a later one with more details.

Posted by: Dan at November 20, 2003 05:36 PM

I downloaded Opera Browser for windows and allowed it to communicate with google.
I then navigated to a brand new site, then checked logs and found a hit from "Mediapartners-Google/2.1"

When you allow opera to Comunicate with google, you do so to allow Google to place text ads (AdWords) at the top of the browser window. Opera generate income from anyone that clicks on these ads through the AdSense program. Mediapartners-Google/2.1 is the AdSense spider used to check a web page and then return relevant ads. You should notice the Ads change each time you visit a new page.

Posted by: Splosh at November 23, 2003 02:00 PM

I recently built an extranet for a large organisation.
Imagine ny bewilderment when a month later googlebot had indexed it. While the main page is not meant to be private, I certainly didn't expect Google to discover it. Yep, the Google toolbar was the culprit and had submitted the extranet to Google.
A small reconfiguration of IIS and all googlebot / google IP addresses now denied access to the extranet.

Problem solved......

still a bit of a sneaky way to pick up data (Spyware???!!?)

Posted by: F1LBY at November 27, 2003 09:50 AM

I have the google tool bar for internet explorer / win XP. Actually I turned on the page rank option, which sends anonymous (or should be) informations on the site i'm watching to give me the page rank. I also have a site, and added it to google ... i visited my site many many many times, but google still hasn't. Do you think I should try with opera!? :D

Posted by: Sym at December 2, 2003 08:15 PM

Sounds just like the Alexa toolbar, if you use if it shows your everymove in the Alexa searches. Even if you go to a server to update you web page it shows that you went there. We cover this subject at http://www.searchwars.squarespace.com if anyone has any further interest.

Posted by: Anthony Cea at May 17, 2004 11:28 PM

I enjoy posting months after a topic has been labelled dead.

Posted by: Anthony Cea at August 1, 2004 06:20 PM

What you saw was a different Googlebot to the usual indexing one. AdSense, included in the free version of Opera, needs to visit a page to retrieve content in order to work out what is on the page. It doesn't, as yet, pass URLs to the main Google indexing engine. This bot has a different User Agent string (I believe including the words "Media Partners" or something similar).

Posted by: ILoveJackDaniels at August 11, 2004 06:36 AM