Website Design
Business to Business websites tailored to your customers' needs
There are occasions when allowing search engine robots access to pages on your site can do more harm than good. Although Google and the other engines are very good at correctly detecting and disregarding duplicate content, it’s wise to pay special attention to which pages on your site might cause the problem and put measures in place to prevent it. Here are four real world examples:
Google webmaster tools offer a clear insight into duplicate content under the Diagnostics > HTML suggestions menu. It shows duplicate Meta description and Title data including the pages in question. If you’re not signed up to Google webmaster tools (you should be) you can try searching for a unique piece of content in quotes. For example, say you’re concerned about an ‘email a friend’ page; First, go to the page in question and copy some of the text from it. It should be text that would always appear on that page and not appear on any other page on your site. Next, go to Google and type in:
site:yourdomain.com "the copied text"
The ‘site:’ is very important! So, if your site was www.example.com and the text you copied was ‘inform your friends and colleagues about any information’ then you would do a Google search for:
site:www.example.com "inform your friends and colleagues about any information"
If you get more than one result returned by Google then you have a dupicate content issue.
Google state that duplicate content can be seen as an attempt to manipulate the engine and result in penalty or removal from the index. They also state that if a site has duplicate content issues they may not index it as often to prevent wasting the resources of their crawlers. In other words, it’s potentially a very big risk indeed. The major engines should be clever enough to figure out if you’re trying to do something bad so the risk of being penalised should be low but, when it comes to how your site ranks, no risk is considerably better than low risk!
There a three options available to webmasters to help prevent search engines from indexing pages you don’t want them to; robots.txt, the noindex meta tag and the canonical link element.
Robots.txt is of limited use. It’s designed to prevent robots from accessing certain pages but doesn’t perform exactly as expected. Google states that “[we] won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web.” We’ve found that even if the pages aren’t linked to externally Google will still index page urls and titles. So, to prevent duplicate content don’t rely on a robots.txt directive.
The second option is the noindex meta tag which, fortunately, stops Google and the other engines in their tracks and results in the page being completely removed from their index. The form it should take is:
<meta name="robots" content="noindex, nofollow" />
The nofollow is optional but assuming the page in question is not the sole gateway to other areas of you site then it reinforces the fact the page is of no use to the engine.
Using the noindex tag we removed hundreds of duplicate content pages from a client site within days. On the face of things they weren’t causing any damage but they were listed as duplicates in webmaster tools and that was incentive enough to act.
The major search engines all support an additional option to help prevent duplicate content. It’s a tag specifically designed for the task and is appropriate if, unlike using the robots.txt or noindex tag, you still want a single copy of the page included in a search engine’s index. It’s a simple tag that, like a meta tag, is inserted in the head of the page in question. If you take our examples above, the address of an ‘email a friend’ page might look like this:
http://www.example.com/email-a-friend.htm?page=1735492
By inserting the canonical link element in the head of the page like this…
<link rel="canonical" href="http://www.example.com/email-a-friend.htm" />
…we tell the search engines to ignore the ‘?page=1735492‘ portion of the url and treat the remainder as the preferred location. Doing this allows the ‘email a friend’ page to be safely indexed by the search engines without them thinking every variant of ‘?page=1735492‘ denotes a separate page.
So, if you have any pages on your site that are linked to from multiple places, especially if they contain url variables, include the noindex meta tag or canonical link element in the head to prevent any possible problems with indexing or search result penalties.
2 Comments
One word for you: canonicalization
And what a choice word it is! I'd only ever associated canonicalisation with www/non-www redirects but you're absolutely right of course, it applies equally to individual pages and is supported by all major search engines. Thanks for your short but sweet comment; I've updated the post accordingly.