The Montoya Herald, a weblog about Blueprint, jQuery, design, music and life, publishing on the web since September 2005. Written by Christian Montoya: developer, designer and entrepreneur.

The Montoya Herald — ChristianMontoya.com

Search

Supported By

Yahoo! robots-nocontent class is a bad idea

Posted on May 3.

I just read about a new effort by Yahoo! to support a class, "robots-nocontent", which website publishers can use to indicate webpage content that they do not want the Yahoo! search engine to read. The idea behind this is that webpages have both important content (like an article) and not so important content (like a navigation list) which have different value to search engines. By "hiding" various on-page content through wrapping it with this class, a web publisher can ensure that the Yahoo! crawler will only pay attention to the most important content on the page. The whole thing is explained in a positive light over at search engine land, but I would like to explain why this idea, and moreover, Yahoo!'s specific implementation, is a very bad idea.

Semantics is something that I have been studying for a while now, especially in the past semester. A lot of people are working on ways to make the web more semantic, coming from different angles with different ideas that still need a lot of work. The road to a more semantic web is a long one, and there's a chance that nothing will be solved any time soon. There's also a chance that the web will never be truly semantic. This is why microformats started; to add semantic value to web markup in its current form, with some general consensus on the how and without any interference with current standards. I don't think I need to explain why microformats are a good idea; if you don't know it for yourself, just take my word for it. The important thing to understand is that the microformats website publishes specific ways to make things more semantic for specific purposes, and these ways become a sort of standard by being openly published and generally agreed upon.

One of the problem with Yahoo!'s implementation is that it doesn't actually follow the microformats draft which adresses the same exact purpose. I hate to pick on Yahoo! because I know that other companies do the same thing, but Yahoo! is getting the flames right now. Yahoo! should not be pushing a unique method specific to their own search engine when there is an open and somewhat standardized method already out there. This is the same thing Microsoft, Netscape, and Google have done in the past and all of these web companies need to stop doing it. Why do I insist this? It's simple: if every company proposed their own technique specific to their own crawler for serving this one purpose, web publishers would have to implement a handful of different techniques just to cater to each search engine. In the end it's more work for web publishers and less value for end users… we're talking about going back to the days of Netscape vs. IE when each browser supported very different, non-standard markup that cause web publishers to have to write a lot of workarounds and hacks to support both browsers. The web is a place for standards where publishers should be able to publish once and leave the work up to the crawlers; one should not have to cater to specific search engines.

Beyond this, just the idea of adding semantic information to a page that clearly defines what content is search-engine-relevant is a bad idea in itself. Up until now, the general idea behind web crawlers has been that web publishers can publish their pages as all user agents will see them, and the crawlers are tasked with determining the semantics of the content. There should not be any need for specific classes that sort of "wave flags" to the crawlers to tell them where to look and where not to look. Even with the limited semantic capabilities we have with current markup standards, it is not hard to convey to user agents which parts of the page make up the main content and which parts make up navigation lists or other unrelated content. Search engine companies should follow existing standards and encourage web publishers to use those standards, not their own exclusive, proprietary techniques.

Now I know very well that this specific technique could take off and get support from all the search engines, much like "nofollow" did in the past, and become it's own little convention that everyone ends up using. I should, however, mention that the ideas some people have about why this is a good idea are a bit beyond me. Apparently this technique is supposed to help stop web spam… does that make sense to anyone? If this offers a way to "hide" page content from search engines, doesn't it mean that you can put your web spam content (such as links meant to entice dumb users) within this class and have search engines completely ignore it, essentially making them oblivious to the fact that they are indexing spammy pages? Maybe you can foresee this:

<html>

...

<p>This is fake but realistic content.</p>

<p class="robots-nocontent">
<a href="spam.com">
This is a fake link that would cause 
this page to be penalized by Yahoo! 
if it wasn't cloaked.
</a></p>

If anything, we're talking about this new technique being used against Yahoo!'s advantage. That doesn't seem like something to celebrate.

So, in short: this is not the first time a web company has done something like this, I think it's a mistake, companies like Yahoo! need to stop doing this and support open standards, and semantics shouldn't require this type of explicit flag-waving. Your thoughts?

Get a Trackback link

4 Comments

  1. John Bokma on May 3, 2007

    Can't agree more. Needing to mark up which parts are content and which parts are relevant to search engines is a bad idea.

  2. Alexander Kinnunen on May 3, 2007

    Excellent article / blog post!
    I agree wholeheartetly (?!) with almost all your points, thanks for taking your time to read this! :)

  3. Jeroen Mulder on May 3, 2007

    I won't be the one that encourages the implementation of this proposal, but your point on its purpose is unfair.

    The bigger picture is that there is a clear desire to define page content seperately from supporting content (such as navigation, footers, sidebars etc.). I have this desire as well and Yahoo! is simply trying to solve this. Without backing by a search-engine, it would never gain traction. As you mentioned yourself, see the nofollow case.

    And there's no way it can be used against Yahoo! See the article at SearchEngineLand. Of course they will still index your document and use all content to determine ranking, spam and whatnot. What they will not do is use the flagged content to determine whether or not the document is a match for the given search. So, I really see no problem for Yahoo!

    Now, if you'd ask me whether it should actually be used by content authors, then I'd agree it is a bad idea. I am all for transparancy and that includes having all content available to be matched.

  4. Christian Montoya on May 3, 2007

    Jeroen: Let's remember that Yahoo! is not using the microformats draft here; instead of following an open convention, they insisted on making up their own method. If they had followed the microformats method, I wouldn't have criticized them so much.

    As for the issue of spam, it can still be used against Yahoo!'s advantage. All of the search engine crawlers have a hard time detecting spam as it is, and now that there is a way to mask page content that is irrelevant to searches, it's very easy to boost a page and make it look like it is completely relevant to a search. If Yahoo! was at all capable of identifying spam already, it wouldn't be such a big deal, but they can't and this only makes things worse.

Leave a comment

Use Markdown or basic HTML. For posting code, use Postable. Please keep comments respectful and on topic.