Moz Knowledge Base/Feature Requests/Moz Feature Requests

PlannedDoneNot planned

Is there a way to tell the crawler to ignore specific variables in a URL (like we can in Google Webmaster Tools)?

Jeremy_FP
suggested this on April 15, 2011 02:55 PM

I'm getting a ton of 302 notices when mozBot looks at the compare product links in our store. In Google webmaster tools, it lets you define specific URL variables to ignore; is there something like that available in the SEOmoz crawler.

 

Comments latest first

User photo
Aaron Wheeler
Moz

Thanks for all the votes! I've added this to our feature request forum so we can track it over time and so people can vote on it as a priority. We really appreciate it!

January 25, 2012 03:17 PM
User photo
dbm44

+1 and another vote for this. As per Google webmaster tools, it would really help to be able to define specific URL variables to ignore

January 25, 2012 08:01 AM
User photo
Capote

+1 on this - a gazillion errors for pages 1 through 10, 20 etc for product pages. Hard to spot the real errors. thx

January 18, 2012 02:45 PM
User photo
MadSkillz

Another vote for this feature ... the value of the crawl goes way down when the results are peppered with pages that shouldn't be included.  In my case the url parameter to be ignored is a session identifier (that only appears for visitors without cookies).

I find myself wondering why the crawler doesn't respect the rel=canonical and treat the page (with the url parameters) as the canonical page (without the url parameters).  Is the crawler's behaviour correct and my expectations about rel=canonical misguided?

November 03, 2011 06:37 PM
User photo
ChristianV

I am having the same issue.  Count my vote for a fix. 

August 09, 2011 12:14 PM
User photo
RealWisam

I'm with Jeremy, this feature would be nice. I'm getting a ton of crawl errors for "duplicate page titles" when, in fact, it's the same page but with different query parameters.

June 27, 2011 11:31 AM
User photo
Aaron Wheeler
Moz

Hey Jeremy!

We don't have a way for you to specify those things at the moment through the Web App at the moment. However, Rogerbot respects all robots.txt parameters, so if you tell him to ignore those types of pages, he will!

Just so you know, the user agent used by the Web App is rogerbot. Unfortunately, the IP address is changing all of the time so we cannot specify a particular IP address. Our crawler obeys robots.txt, and in order for us to crawl your site you need to make sure your robot policy allows the user agent rogerbot. You can either have a robots.txt file that allows all bots (including rogerbot), or you can exclude some robots, but allow rogerbot.

In order to allow SEOmoz to crawl your website while blocking bots you must create a robots.txt file. (You can read more about robots.txt here: http://www.seomoz.org/knowledge/robotstxt). After you have done that, you can allow SEOmoz’s crawler (rogerbot) to crawl your site with the following code within the robots.txt file:

User-agent: *
Disallow: /no-bots/block-all-bots-except-rogerbot-page.html

User-agent: rogerbot
Allow: /no-bots/block-all-bots-except-rogerbot-page.html

It is important to put the most broad rules first in the robots.txt file. In this example, we are disallowing all bots (“*") from crawling a specific HTML file and then specifically allowing rogerbot to crawl that same file. In this case, the second, more specific rule, takes precedence.

You can read even more about this here: http://www.seomoz.org/knowledge/robotstxt and http://www.seomoz.org/dp/rogerbot

As I'd mentioned, you can use these same types of parameters to exclude rogerbot from certain subdirectories or pages. You could also use canonicalization to consolidate those URLs in the eyes of search engines, though they might not have a problem with these things anyways. Hope this helps!

April 15, 2011 04:45 PM