Dexterity Unlimited presents articles about web design and web site architecture.

Dexterity Unlimited - Simple, Standards-based, Semantic » Search Engines

Archive for the ‘Search Engines’ Category

Search Engine Bot Verification Woes

Tuesday, November 6th, 2007

Along with many of you, I have read several articles about "proxy hacking" and other issues related to the need to verify the source of user agents purporting to be a known search engine spider. For those of you unfamiliar with such problems, "proxy hacking" involves routing a bot through a proxy server in order to get to your site. Page ranking associated with your site could then be transferred to the URL of the proxy server. Other issues involve browsers or bots masquerading as a known search engine spider for any purpose.

After reading some of these articles, I decided to do something to protect my site. What is there to fear? Proxy hacking can be a legitimate concern. As for the matter of simple masquerading; if you enter my establishment while flashing a fake I.D., it makes me think you’re up to something.

The means of identification that has been most often noted as being the most effective, even by the search engines themselves, is a two step process. First, perform a reverse-DNS lookup of the purported bot’s IP address. If the host name returned does not match the domain to which the bot belongs, then it has failed the test. If the resulting host name is, indeed, in the domain associated with the bot, then you perform a forward DNS to IP lookup to determine if that host name is actually associated with the IP address where the request originated.

This method has been described by Matt Cutts in an article entitled How to verify Googlebot on the Official Google Webmaster Central Blog. Matt also makes note of MSN approving the same method for MSNbot, and Ask Jeeves doing the same for their Teoma bot in MSFT adds bot verification, an article on his blog.

The Yahoo! Search Blog has an article entitled Yahoo! Search Crawler, Slurp, has a new Address and Signature Card where they state:

Once you have the host name (in this case, lj612134.crawl.yahoo.net), you can then check if it really is coming from Yahoo! Search. The name of all Yahoo! Search crawlers will end with ‘crawl.yahoo.net,’ so if the name doesn’t end with this, you know it’s not really our crawler.

Ask.com has roughly the same advice on their About Ask.com: Webmasters page, where they state:

A User-Agent is no guarantee of authenticity as it is trivial for a malicious user to mimic the properties of the Ask Crawler. In order to properly authenticate the Ask Crawler, a round trip DNS lookup is required. This involves first taking the IP address of the Ask Crawler and performing a reverse DNS lookup ensuring that the IP address belongs to the ask.com domain. Then perform a forward DNS lookup with the host name ensuring that the resulting IP address matches the original.

MSN, not to be left out, has on their page Live Search: Search robots in disguise the following:

Once you have the host name (in this case, livebot-207-46-98-149.search.live.com), you can check that it really is coming from Live Search. The name of all live search crawlers will end with ‘search.live.com’. If the name doesn’t end with ‘search.live.com’, you know it’s not really our crawler.

Finally, you need to verify that the name is accurate. In order to do this, you can use Forward DNS to see the IP address associated with the host name. This should match the IP address you used in Step 2 – if it doesn’t, it means the name was fake.

The method is valid, and it works… most of the time. When does it not work? When the search engine company doesn’t have their DNS set up properly. “What’s that?” you say, “How can you accuse the likes of Google, Yahoo!, MSN, and Ask Jeeves of something like that?”

As I said above, I decided to do something to protect my site from fake bots. I made a nice little PHP script to perform the checks outlined by the search engine companies, and I put an include statement at the top of my site’s pages to run that piece of code. Bots that weren’t who they said they were received a 503 HTTP status code (service unavailable). Everyone else got the regular page.

I had set up the PHP script to keep a log of all requests that were sent the 503 status page. I tested the function using an online testing service that let me choose the user agent. Testing was also done using a user agent switcher in Firefox. The script seemed to function properly, and the results in the log file were as expected.

When I checked the log file, I also checked the IP addresses from the command line on my local computer, just to verify that things were working properly. I noticed some strange results. A request from a user agent claiming to be MSNbot came from the IP address 65.55.233.40 on 15 September. The reverse DNS result from that address pointed to the name bl1sch2041711.phx.gbl. For those of you familiar with official TLDs, you will notice that .gbl is not a recognized name.

Just what is going on here? It appears that Microsoft, in their infinite wisdom, have the inverse-address settings of one or more blocks of IP addresses set to point to a domain name that does not exist for the outside world. Wait a minute, didn’t MSN just say, "Once you have the host name (in this case, livebot-207-46-98-149.search.live.com), you can check that it really is coming from Live Search. The name of all live search crawlers will end with ‘search.live.com’. If the name doesn’t end with ‘search.live.com’, you know it’s not really our crawler."? It seems that this isn’t correct in the real world.

Microsoft, however, is not the only group that does not live up to their own published standards.

The following day, a hit from 202.160.178.78 is entered in the log file. It comes from Yahoo! Slurp China, and is on the domain name inktomisearch.com, a domain which Yahoo! claims to have stopped using for Slurp.

A few days later, a bot claiming to be Ask Jeeves/Teoma comes from 65.214.36.73. This address resolved to a host in the directhit.com domain. This domain is owned by Ask Jeeves, but their bot is not supposed to be coming from that domain. As far as I know, they don’t even use it at all. I used the ask.com contact form to let them know about this one, and it has now been changed to the ask.com domain. They did, however, neglect to respond to my query in any other way, like maybe a note to say “Thanks, we missed that one.”

Another hit from Ask Jeeves came from 206.80.1.253, which resolves to g2spf.jeeves.ask.info. This is a legitimate domain owned by Ask Jeeves, but it still isn’t the domain they said Teoma would be coming from.

A hit from 65.55.241.213 claimed to be MSNbot. This IP address has no inverse address set at all. The IP block, however, is owned by Microsoft. The address 72.14.193.166 also has no inverse address set. This IP block belongs to Google. Once again, they just aren’t playing fair.

No, we don’t live in a perfect world. It would be nice, though, if people would go just that extra step further to make sure that mistakes which are pointed out to them are resolved (as Ask Jeeves apparently did). The phx.gbl issue has been around for years, and Microsoft (as far as I know) has no intention of making any changes in this regard.

So, what do I do now? I’ve made some changes in my script as I see more results in my log files. It’s getting better, and one of these days I will publish the PHP code here for the benefit of others. It would all be so much easier if the bot owners would keep their DNS practices up to par.

Better Searching With Google

Sunday, September 23rd, 2007

For several years I have been of the opinion that the majority of people using Google’s search engine do not really know how to use but a small fraction of its potential when looking for something online. Wading through server logs and Google Analytics reports has only reinforced this opinion.

I suppose that this article is not the type to reach the general searching public, but I find that even people who have been online for years, and who work in Internet related fields are unaware of some of the basic functions they can use to get better results from their searches.

I would like to give people a couple basic tips that should help them to narrow their searches in order to find web pages that contain what they are really trying to find.

Use Quotes

When most people search using Google, they type a few words into the search box, and press the button, hoping for the best. While this can yield results containing useful pages, often these results contain countless pages that have nothing to do with what you really want to find.

Consider looking for ‘search engine optimization’. If you put that text into the search box by themselves and click the search button, you are asking Google to give you a list of any pages that contain all of these three words. Somewhere in that list might be a page discussing a search for ways to optimize the output thrust of a jet engine. Is that what you had in mind? It does contain all of the search terms.

Placing quotation marks (”) around words that constitute phrases you are looking for will greatly improve the relevancy of the results returned by Google. In some cases, this is a necessity when searching for some words, particularly names, that contain spaces or hyphens. While doing genealogy research online, I must use quotes around the family name Te Kulve in order to get desired results.

Plus and Minus Signs

Using a plus (+) or minus (-) sign immediately before a word or a quoted phrase in your search query will require Google to be certain that pages containing that word or phrase are included or excluded from the results.

Have you ever initiated a search only to find that hundreds of pages containing your search terms, but obviously unrelated to what you want to find, are included in the results? One way to get those pages out of the results is to find another term that most of them have in common that you can exclude with a refined query.

If hundreds of pages about baseball show up in your search for a quantum mechanics reference, simply add -baseball to your search query. The same technique works with other advanced search options. Are you getting hundreds of unrelated results from a particular site? Use -site:www.domain.com to exclude pages within that site from the search results.

Do some of the results in the list not contain one of the words you are searching for? Use the plus sign in front of that word to require that the results all contain that word. I frequently use a plus sign in front of each word or phrase in my search queries. Yes, it’s overkill, and most times not necessary, but old habits die hard.

What does this mean for SEO?

For those of us involved in Search Engine Optimization, it can only benefit us if people learn how to search more effectively. It’s usually easier to get nearer the top of a list of fifteen thousand results than a list of seventeen million results.

It also means that we need to keep our eyes open to what people are searching for when they come to our sites via Google (or any other search engine). Look at your log file reports, or Google Analytics reports. See how people find you. Are their searches really looking for something different, or are you hitting the right targets?

Help your friends and neighbors find what they’re looking for. Give them a couple basic tips about searching, and I’m sure they’ll see better results. They’ll thank you for it.