Search engines that are no search engines
Last Updated: 2007-11-09 15:53:19 UTC
by Johannes Ullrich (Version: 1)
a.b.c.d - - [09/Nov/2007:15:24:35 +0000] "GET /portreportascii.html?date=2007-11-09 HTTP/1.0" 200 500572 "-" "gsa-crawler (Enterprise; S5-FTNF3BWZPUJAS; email@example.com)" "-"
At first, I thought "oh well, its google". But looking at the user agent string closer, reveals some subtle differences. This is a Google search appliance, not the uber-google-bot we all love. The regular Google bot looks like this:
126.96.36.199 - - [09/Nov/2007:15:24:37 +0000] "GET /date.html?port=47109&date=2007-10-25 HTTP/1.1" 200 7538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
I have seen similar cases a few times now. While this one was not malicious, in some cases attacks used google's (or other search engine) user agent strings. I can only assume that this is an attempt to fit in better, and maybe retrieve a search engine version of the page. If anybody knows a good reference where to find IP address ranges used by certain search engines: let us know.
(and btw... if you need bulk data access to dshield data: Please ask. Spidering the site is just not very efficient and you will run into some anti-harvesting traps sending you in circles)
Johannes B. Ullrich
Chief Research Officer, SANS Technology Institute