[aesop_parallax img=”https://webmarketingschool.com/wp-content/uploads/2015/07/shutterstock_176194913.jpg” parallaxbg=”on” captionposition=”bottom-left” lightbox=”on” floater=”on” floaterposition=”left” floaterdirection=”up”]
A small thread yesterday on HackerNews claimed that Twitter had de-indexed their entire site, using Robots.txt to block crawler access.
They link to the file http://www.twitter.com/robots.txt which appears as below:
OK that looks like they are de-indexing the site?
As you can perfectly well see, they ARE disallowing robots from crawling www.twitter.com, but that is absolutely by design and not by accident.
If you check the robots.txt file for the ROOT domain, you’ll see it is quite a bit more in depth:
So why the Difference?
Its actually really simple, one of the basic elements of SEO is preventing duplicate content. Typically SEO’s recommend only having one version of your site, either on www.yoursite.com or just yoursite.com, but NOT both.
Twitter have decided to go the route of preventing indexation by way of preventing crawling of the www. subdomain, rather than rewriting their URLs to not include them, while redirecting the traffic.
You can see their indexation not including www.twitter here:
vs. the indexation for the www.twitter subdomain specifically:
Notice the two differences:
Firstly, there are 1.3 billion pages in the index for the primary domain, and just 8 million for the incorrect version with “www”.
Secondly, each of the results in the second version dont have a meta description, thats because Google can not crawl the pages, so doesnt know how to describe them.
Then why are they in the index?
There’s a pretty simple explanation for that as well, other web pages link to twitter accounts, often incorrectly, INCLUDING the www. in the URL.
That’s why Google knows about the page, so it appears in a Site: search, but its not in the index, nor do they have a description for it.
So can we stop losing our Minds?
4 thoughts on “Twitter Block Web Crawlers via Robots.txt?”
Of course all the www. results are in the Google index and they are shown at normal queries. Try “uottawageegee” for example. You’ll find a http://www.twitter.com result with it’s description saying crawling access ist blocked by robots.txt.
You explained correctly that robots.txt regulates crawling. Crawling http://www.twitter.com is prohibeted. But controlling indexation with robots.txt isn’t working since years. If you want to make sure these results don’t get indexed:
Make them accessible via robots.txt and use meta-robots or x-robots “noindex, follow”
But this has downsides: Bots will crawl all these pages. If you have a LOT of urls that is nothing you want. First: This can get heavy on server performance. Second: It can lead to a lot of “bad” / duplicate / use- or contentless results.
So: Twitter doesn’t stop Google from indexing via robots.txt, it only stops them from crawling these URLs. But there might be a strategy behind this decision. But it’s also completely legit to assume they didn’t get the last 5 years of indexing strategies.
Ah – the allow escaped fragment parameter. Lovely giveaway that they’re prerendering their front end for search engines rather than hoping google will know how to handle all that JS…
I LOVE having awesome SEO buddies comment 🙂
Great pickup, but rather incongruous with the failure to correctly redirect the dub dub version though right?
I wrote a couple of posts about how Twitter used to suck at SEO
I’m not sure blocking the www version is a great move, though. Surely they are limiting crawlers requests, but they are also not consolidating their link equity to the not-www version, since spiders can’t request the pages and know about the 301s.
(and coming back to dupes, they are still really bad at)…