suggested this on April 15, 2011, 2:55 PM
I'm getting a ton of 302 notices when mozBot looks at the compare product links in our store. In Google webmaster tools, it lets you define specific URL variables to ignore; is there something like that available in the SEOmoz crawler.
Thanks for all the votes! I've added this to our feature request forum so we can track it over time and so people can vote on it as a priority. We really appreciate it!
+1 and another vote for this. As per Google webmaster tools, it would really help to be able to define specific URL variables to ignore
+1 on this - a gazillion errors for pages 1 through 10, 20 etc for product pages. Hard to spot the real errors. thx
Another vote for this feature ... the value of the crawl goes way down when the results are peppered with pages that shouldn't be included. In my case the url parameter to be ignored is a session identifier (that only appears for visitors without cookies).
I find myself wondering why the crawler doesn't respect the rel=canonical and treat the page (with the url parameters) as the canonical page (without the url parameters). Is the crawler's behaviour correct and my expectations about rel=canonical misguided?
I am having the same issue. Count my vote for a fix.
I'm with Jeremy, this feature would be nice. I'm getting a ton of crawl errors for "duplicate page titles" when, in fact, it's the same page but with different query parameters.
We don't have a way for you to specify those things at the moment through the Web App at the moment. However, Rogerbot respects all robots.txt parameters, so if you tell him to ignore those types of pages, he will!
Just so you know, the user agent used by the Web App is rogerbot. Unfortunately, the IP address is changing all of the time so we cannot specify a particular IP address. Our crawler obeys robots.txt, and in order for us to crawl your site you need to make sure your robot policy allows the user agent rogerbot. You can either have a robots.txt file that allows all bots (including rogerbot), or you can exclude some robots, but allow rogerbot.In order to allow SEOmoz to crawl your website while blocking bots you must create a robots.txt file. (You can read more about robots.txt here: http://www.seomoz.org/knowledge/robotstxt). After you have done that, you can allow SEOmoz’s crawler (rogerbot) to crawl your site with the following code within the robots.txt file:User-agent: *Disallow: /no-bots/block-all-bots-except-rogerbot-page.htmlUser-agent: rogerbotAllow: /no-bots/block-all-bots-except-rogerbot-page.htmlIt is important to put the most broad rules first in the robots.txt file. In this example, we are disallowing all bots (“*") from crawling a specific HTML file and then specifically allowing rogerbot to crawl that same file. In this case, the second, more specific rule, takes precedence.You can read even more about this here: http://www.seomoz.org/knowledge/robotstxt and http://www.seomoz.org/dp/rogerbot
As I'd mentioned, you can use these same types of parameters to exclude rogerbot from certain subdirectories or pages. You could also use canonicalization to consolidate those URLs in the eyes of search engines, though they might not have a problem with these things anyways. Hope this helps!
Support Software by Zendesk