First of all, for those that are just getting their feet wet, what is duplicate content? Duplicate content is just that - text that is the same, word for word, on different pages; mostly found on different websites as well (it wouldn't make much sense to duplicate your own pages). We differentiate between external (content on external websites) and internal (content within the same website) duplicate content. The bottom line is that you are not penalized for duplicate content, unless your entire website was built for that purpose. This comes straight from the horses mouth - Adam Lasnik, Search Evangelist/ Web Master Communications expert from Google. Why? Read on to find out.
External duplicate content raises more flags than does internal duplicate content, because it is located on two or more different websites, and therefore the assumption is made that the original was copied by a different person or entity. With internal pages, duplicate content is primarily unintentional, and the assumption is made that the writer of the content is one and the same.
It's not uncommon for pages on CMS (Content Management Systems) systems to be duplicated inadvertently. The system that this website was designed with (Drupal), for example, by default names pages /node#, where # is substituted with a page number. Because this is not very search-engine friendly (we recommend using keywords in your page path - check out our article on SEO 101 and read the section entitled "SEF (Search Engine Friendly) URL's/ Filenames"), we implement Drupal's "clean URL" mechanism, which automatically generates filenames based on keywords in the title.
The problem is that at this point, two versions of each page exist - the /node# version and the /friendly-page-title version. By default, search engines will index both pages and recognize them as different entities. However, when they recognize one page having the same exact content as the other, they are likely to disregard one of the pages. For this reason, we block out (via a robots.txt file located in the top level public (ie. public_html or www) directory) all the /nodes pages with the following statement:
User-agent: *
Disallow: /node
In most cases, duplicate content is not penalized. Why? Because of content syndication. It can be useful for important content to be found in more than one place, and in more than one variation. Just like with newspapers and magazines, the more sources an item has the more easier it is for the consumer to find.
How does a search engine determine if your content is the original, or first, version, and why is that important? The first indexed version of an article is typically the one that receives the most ranking points. That's why it's important that if you publish content you notify the search engines (in particular the big three - Google, Yahoo, and MSN) before someone else does. You can do this via their Webmaster Sitemap tools. Read more about those in our article on sitemaps: roadmaps for visitors and search engines.
The only time you would receive a duplicate content penalty is if you were trying to falsely promote the work as your own. While search engines typically won't pick up on such an incident, if your entire website serves this purpose you'll most likely be discovered and penalized. Remember that content is king - it's vital that your website contain primarily original and fresh content that is regularly updated.
All Content © 2007 - 2010 Contract Web Development, Inc. All Rights Reserved. Privacy Policy | Terms of Use | Powered by Drupal
One Sentence from Einstein!
"We can't solve problems by using the same kind of thinking we used when we created them." - Albert Einstein!
What do you think?
Post number 089 on www.guruofsearch.com
Greetings, amazing! Not clear for me, how often are you updating your www.guruofsearch.com?
Thanks
Edwas
Not as often as we'd like
Not as often as we'd like, although we don't need to be in order to catch spam
Google is the best!!
Google is the best SE! Go Google!
Technical problem...
Each time I go to respond to a post, I see a php error....
Do you need to get a minimum amount of posts to reply in a posting?
No minimum post limit
No, you simply need to refrain from spamming
Good article
Helpful article for me, thanks a lot.
p.s. Your rating thing is broken, I have no idea which is high/low
Updated rating system
Hi there,
We've updated our rating system to display the total number of stars (ie. 1/5 being the worst and 5/5 being the best). You can rate each one of our articles by selecting your rating below the article (and above the comments section).
Google will not penalize for duplicate content
Google will not penalize you for duplicate content; rather - your content won't be indexed as unique content. There's no point in Google indexing the same piece of information over and over again, is there? Unlike products in a grocery store, where when you take one off the shelf another one needs to be on hand for the next customer. Online, one copy of content is accessible to everyone.
Content is duplicated all the time, and in legitimate fashion. That's how news and press releases get syndicated. The one thing to remember is that if you do decide to duplicate content, or a portion of something (and this applies to any literary work - think plagiarism), is this: Make sure you specify the original author and source URL of the original document.
So how do you ensure you're getting credit for original content? The only sure fire way of doing so is by making sure that not only does Google index your document first (install a sitemap and use Webmaster Tools to communicate this information to Google), but the more prominent back links are to your content, the more reputable it will be - think of it as votes for the legitimacy and value of your document.
So this is how it works in a nutshell. Independent of everything else, the first source Google can find gets credit. If there are multiple unique sources for the same topic, then a site's overall ranking, as well as the relevance and prominence of back links to your content, are taken into account.
How to prevent Drupal front pages from being blocked?
Hello,
Great article - this answered a lot of my questions about duplicate content and how it is treated by search engines. Two questions using Drupal:
1) Since the front page has its own name, will it be indexed if I block /node in robots.txt?
2) If I block /node in robots.txt, does that mean only the front page will be indexed, but not subsequent /node pages (ie. archived teaser pages)?
Thanks!
Drupal nodes and duplicate content
Hey there,
Great questions. I'm not sure I fully understand the first one, but here's a go at them:
1) The handling of your front page in Drupal depends on whether you have teasers set up (ie. a front page with a bunch of teasers that link to full pages). If you do, then the front page is only accessible via example.com and example.com/node. If you have /node blocked, search engines will only index example.com and you won't get hit for duplicate content on example.com/node. If you don't have teasers setup, your front page will be located at example.com and at example.com/title. If this is the case, you may want to block the /title version from robots.txt so there's no duplicate. I don't think you'll get penalized but you want to make sure Google indexes your root page, and not the /title page.
2) Yes, if you block /node, then the front page (in teaser mode) will get indexed. However, subsequent pages (ie. /node?page=2) will not.