Search Engine Bot Verification Woes — Repost

Another article from the 2008 version of my web site continues to receive inbound traffic. I repost the article here in hopes that someone might still find it useful. Again, the information is at least four years old.

The Original Article

Along with many of you, I have read several articles about "proxy hacking" and other issues related to the need to verify the source of user agents purporting to be a known search engine spider. For those of you unfamiliar with such problems, "proxy hacking" involves routing a bot through a proxy server in order to get to your site. Page ranking associated with your site could then be transferred to the URL of the proxy server. Other issues involve browsers or bots masquerading as a known search engine spider for any purpose.

After reading some of these articles, I decided to do something to protect my site. What is there to fear? Proxy hacking can be a legitimate concern. As for the matter of simple masquerading; if you enter my establishment while flashing a fake I.D., it makes me think you’re up to something.

The means of identification that has been most often noted as being the most effective, even by the search engines themselves, is a two step process. First, perform a reverse-DNS lookup of the purported bot’s IP address. If the host name returned does not match the domain to which the bot belongs, then it has failed the test. If the resulting host name is, indeed, in the domain associated with the bot, then you perform a forward DNS to IP lookup to determine if that host name is actually associated with the IP address where the request originated.

This method has been described by Matt Cutts in an article entitled How to verify Googlebot on the Official Google Webmaster Central Blog. Matt also makes note of MSN approving the same method for MSNbot, and Ask Jeeves doing the same for their Teoma bot in MSFT adds bot verification, an article on his blog.

The Yahoo! Search Blog has an article entitled Yahoo! Search Crawler, Slurp, has a new Address and Signature Card where they state:

Once you have the host name (in this case, lj612134.crawl.yahoo.net), you can then check if it really is coming from Yahoo! Search. The name of all Yahoo! Search crawlers will end with ‘crawl.yahoo.net,’ so if the name doesn’t end with this, you know it’s not really our crawler.

Ask.com has roughly the same advice on their About Ask.com: Webmasters page, where they state:

A User-Agent is no guarantee of authenticity as it is trivial for a malicious user to mimic the properties of the Ask Crawler. In order to properly authenticate the Ask Crawler, a round trip DNS lookup is required. This involves first taking the IP address of the Ask Crawler and performing a reverse DNS lookup ensuring that the IP address belongs to the ask.com domain. Then perform a forward DNS lookup with the host name ensuring that the resulting IP address matches the original.

MSN, not to be left out, has on their page Live Search: Search robots in disguise the following:

Once you have the host name (in this case, livebot-207-46-98-149.search.live.com), you can check that it really is coming from Live Search. The name of all live search crawlers will end with ‘search.live.com’. If the name doesn’t end with ‘search.live.com’, you know it’s not really our crawler.

Finally, you need to verify that the name is accurate. In order to do this, you can use Forward DNS to see the IP address associated with the host name. This should match the IP address you used in Step 2 — if it doesn’t, it means the name was fake.

The method is valid, and it works… most of the time. When does it not work? When the search engine company doesn’t have their DNS set up properly. “What’s that?” you say, “How can you accuse the likes of Google, Yahoo!, MSN, and Ask Jeeves of something like that?”

As I said above, I decided to do something to protect my site from fake bots. I made a nice little PHP script to perform the checks outlined by the search engine companies, and I put an include statement at the top of my site’s pages to run that piece of code. Bots that weren’t who they said they were received a 503 HTTP status code (service unavailable). Everyone else got the regular page.

I had set up the PHP script to keep a log of all requests that were sent the 503 status page. I tested the function using an online testing service that let me choose the user agent. Testing was also done using a user agent switcher in Firefox. The script seemed to function properly, and the results in the log file were as expected.

When I checked the log file, I also checked the IP addresses from the command line on my local computer, just to verify that things were working properly. I noticed some strange results. A request from a user agent claiming to be MSNbot came from the IP address 65.55.233.40 on 15 September. The reverse DNS result from that address pointed to the name bl1sch2041711.phx.gbl. For those of you familiar with official TLDs, you will notice that .gbl is not a recognized name.

Just what is going on here? It appears that Microsoft, in their infinite wisdom, have the inverse-address settings of one or more blocks of IP addresses set to point to a domain name that does not exist for the outside world. Wait a minute, didn’t MSN just say, "Once you have the host name (in this case, livebot-207-46-98-149.search.live.com), you can check that it really is coming from Live Search. The name of all live search crawlers will end with ‘search.live.com’. If the name doesn’t end with ‘search.live.com’, you know it’s not really our crawler."? It seems that this isn’t correct in the real world.

Microsoft, however, is not the only group that does not live up to their own published standards.

The following day, a hit from 202.160.178.78 is entered in the log file. It comes from Yahoo! Slurp China, and is on the domain name inktomisearch.com, a domain which Yahoo! claims to have stopped using for Slurp.

A few days later, a bot claiming to be Ask Jeeves/Teoma comes from 65.214.36.73. This address resolved to a host in the directhit.com domain. This domain is owned by Ask Jeeves, but their bot is not supposed to be coming from that domain. As far as I know, they don’t even use it at all. I used the ask.com contact form to let them know about this one, and it has now been changed to the ask.com domain. They did, however, neglect to respond to my query in any other way, like maybe a note to say “Thanks, we missed that one.”

Another hit from Ask Jeeves came from 206.80.1.253, which resolves to g2spf.jeeves.ask.info. This is a legitimate domain owned by Ask Jeeves, but it still isn’t the domain they said Teoma would be coming from.

A hit from 65.55.241.213 claimed to be MSNbot. This IP address has no inverse address set at all. The IP block, however, is owned by Microsoft. The address 72.14.193.166 also has no inverse address set. This IP block belongs to Google. Once again, they just aren’t playing fair.

No, we don\’t live in a perfect world. It would be nice, though, if people would go just that extra step further to make sure that mistakes which are pointed out to them are resolved (as Ask Jeeves apparently did). The phx.gbl issue has been around for years, and Microsoft (as far as I know) has no intention of making any changes in this regard.

So, what do I do now? I’ve made some changes in my script as I see more results in my log files. It’s getting better, and one of these days I will publish the PHP code here for the benefit of others. It would all be so much easier if the bot owners would keep their DNS practices up to par.

2 comments

    • Joe on 2014/06/29 at 01:27
    • Reply

    You could be classifying google/microsoft etc. workers as bots. How do you know it is a bot?

    1. Hi Joe! I set things up to only classify it as a bot if it first identifies itself as a bot in the user-agent string. The domain verification is to be certain it’s a bot from the company it’s claiming to represent.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.