Tuesday, July 18, 2006

I just finished blowing an afternoon fighting an issue in ASP.NET 2.0 and crawler detection.  There is a slick feature where you can define browser capabilities with .browser files.  My tests were indicating that Yahoo (Slurp) and Ask (Teoma) were not being detected as crawlers by the framework.  Simple enough.  I'll just create a .browser file and add it to App_Browsers in my app.  In the file I will do a User Agent match on Slurp and Ask. 

A few hours pass by.....its not working....dive into reflector....find some hard coded string having to do with crawler detection.

Take a look at System.Web.Configuration.BrowserCapabilitiesFactory.CrawlerProcess(NameValueCollection headers, HttpBrowserCapabilities browserCaps).  In there is a curious little string "crawler|Crawler|Googlebot|msnbot" being passes to a RegEx processor.

It looks like Yahoo and Ask crawlers are detected as Mozilla browsers, but not as crawlers.  Google and MSN on the other hand are detected as both.  Curiously Google and MSN are both in the little hard coded string that is compiled into the assembly.

My theory is that the browser capabilities detection sees these popular crawlers as Mozilla browsers first and then fails to detect that they are crawlers.  Only those with a user agent string matching what is coded in the framework also get flagged as crawlers.  If anyone has more insight, please comment.

So, to fix the issue I ended up implementing my own wrapper.  Here you go:

      public static bool IsCrawler(HttpApplication app)
      {
         bool isCrawler = app.Context.Request.Browser.Crawler;
         // Microsoft doesnt properly detect several crawlers
         if (!isCrawler)
         {
            Regex regEx = new Regex("Slurp|slurp|ask|Ask|Teoma|temoa");
            isCrawler = regEx.Match(app.Request.UserAgent).Success;
         }
         return isCrawler;
      }