A site map is an excellent way to make sure that it is your content that makes i...

andrewljohnson · on March 9, 2010

I call bullshit. Have you actually ever had someone clone your site or tried to protect against it this way? I am all ears for your story... I have never heard of anything like that.

Setting up a sitemap is usually just a placebo for finicky webmasters, and it's a waste of time to boot. The biggest sin at a start-up is wasting time, and this easily qualifies.

The only time you want a sitemap is when you don't have a text front-end, and even then, you are way better off mimicking your JavaScript with a text layer for the search engines that implementing some sitemap.

Even submitting a page in a sitemap doesn't ensure it gets crawled, and it certainly won't get crawled if you have a large site and don't set up your sitemap just right.

jacquesm · on March 9, 2010

> I call bullshit.

That's cool with me.

> Have you actually ever had someone clone your site or tried to protect against it this way?

Yes, otherwise I wouldn't have written that, now wouldn't I?

> I am all ears for your story... I have never heard of anything like that.

That you haven't heard of something is not a reason to assume that it doesn't exist, there are actually more things that you probably haven't heard about that aren't bullshit either.

In fact, there are probably lots of those things.

I recently did a fairly large project, well documented here on HN and in the press where I did just that. Some less than honorable characters were cloning the data as fast as it hit the server and turning it into 'mfa' fodder, so at some point I shut that down and submitted the site to google which then crawled it at it's leisure.

Even today that crawl isn't 100% complete yet, and the only way you can reach those pages right now is by going through the google index.

Long term I expect plenty of those pages to be linked again, both internally by linking pages that are related as well as externally by people that link to their content.

I get about 50 emails daily confirming that the strategy worked, those people have lost their content and find it again through google and this is the only copy of it on the web, in spite of active attempts at making clones.

The ip blocklist for that site is close to 10,000 IPs.

> Even submitting a page in a sitemap doesn't ensure it gets crawled, and it certainly won't get crawled if you have a large site and don't set up your sitemap just right.

No, if you don't set it up right then of course, it won't work, but that goes for most things in the technology world.

Right now there are 197,000 pages indexed on that site according to google so I really can't complain, it seems to have worked very well.

andrewljohnson · on March 9, 2010

Ok, well maybe bullshit is a strong term, so I apologize.

Still, it seems really odd to be using a sitemap like that. That's certainly not the intended purpose, and if that were a reasonable defense, what's to stop the spammers from employing it as an offense?

As far as I can tell, you use a sitemap if your site doesn't map in the normal way. Even then, it's not nearly as effective as a text site with any in-links - so much so that I wouldn't hinge any SEO strategy on a sitemap, and I wouldn't recommend setting one up to any new start-up founder. Even if your observations on this are correct, then you are a very special case.

It's also worth pointing out that it seems your site map was no defense until you took other measures, which wasn't clear from your original post:

"so at some point I shut that down and submitted the site to google"

Responding to some of your points:

"I get about 50 emails daily confirming that the strategy worked..."

* This confirms that your site is indexed, but it doesn't confirm your strategy beat the spammers. It may be more of Google's algorithms deciding you are best, no? I can't say for sure obviously, but you might have fared better if you hadn't done this at all.

"Even today that crawl isn't 100% complete yet"

* Part of the reason for this is because your pages have a low-priority to be crawled, because they are submitted via a site map.

jacquesm · on March 9, 2010

> Ok, well maybe bullshit is a strong term, so I apologize.

no problem.

> Still, it seems really odd to be using a sitemap like that. That's certainly not the intended purpose,

Agreed, but that's what we're hackers for right ?

> and if that were a reasonable defense, what's to stop the spammers from employing it as an offense?

That they didn't have the URL, but the google bot did.

> As far as I can tell, you use a sitemap if your site doesn't map in the normal way.

Or if your site has crappy navigation, or if you want to comply with some countries' accessibility rules.

> Even then, it's not nearly as effective as a text site with any in-links - so much so that I wouldn't hinge any SEO strategy on a sitemap, and I wouldn't recommend setting one up to any new start-up founder.

Agreed, but we're not all just start-up founders, and even if we are we're not all above using the occasional trick to get an edge.

> Even if your observations on this are correct, then you are a very special case.

That's possible.

> It's also worth pointing out that it seems your site map was no defense until you took other measures, which wasn't clear from your original post:

"so at some point I shut that down and submitted the site to google"

The thing I shut down was the publicly accessible version of the map, so only google had access to the real thing.

> Responding to some of your points:

"I get about 50 emails daily confirming that the strategy worked..."

* This confirms that your site is indexed, but it doesn't confirm your strategy beat the spammers. It may be more of Google's algorithms deciding you are best, no? I can't say for sure obviously, but you might have fared better if you hadn't done this at all.

Google keeps something called a 'quad' list if I'm not mistaken, which contains a series of word ids in sets of four for every page that exists in their index. If certain sets of 'quads' are unique to your page then those are used to judge your page as original for that bit of content. I gather that quite a few of the 'random keywords spam pages' are built on that predicate. I'm not sure if that is outdated or even plainly wrong but it would explain the pattern in clone sites sometimes ranking higher than the originals simply because they got crawled earlier.

> "Even today that crawl isn't 100% complete yet"

* Part of the reason for this is because your pages have a low-priority to be crawled, because they are submitted via a site map.

That's quite possible, but I'm not in a hurry. The project was huge and if it takes a year to get it indexed I'm perfectly content with that.

The 197,000 pages result in about 7000 unique visitors daily, the total number of pages is in the millions.

andrewljohnson · on March 9, 2010

Ok, well, you win again.