Hacking Analytics, tracking iOS, Spiders & Bots

A few weeks ago I was having a conversation with a partner of ours, and the topic of tracking spider activity on fresh content came up.

While the company in question has good in-house dev resource, asking them to build something bespoke to track Googlebot was off the table.  Lets face it, most companies have dev queue’s, not a surplus of developers – and things like this are often de-prioritised out of existence!

This reminded though me of a post from a few years back that Ani Lopez (Analytics guru and all round nice guy) tweeted me, detailing a clever hack to track search engine spiders.

Analytics 101 – How it works

bot-traffic-affects-google-analytics

As you are probably aware, Google Analytics much like any Javascript stats package relies on JS to function.

Without it, traffic is simply ignored as the details are simply not forwarded to google.

Additionally, it needs the ability to cookie browsers using it, so it can track repeat visitors etc.

 

The problem

Unfortunately, there are plenty of devices that simply don’t execute JS or store cookies that you would probably want to track, the best historical example being feature phones / WAP phones (everyone remembers them right, before we all had iPhones or Galaxy’s?).

Guess what, search engine spiders fall exactly into that category as well!

Google very kindly have a workaround for WAP phone tracking, by responding to requests from devices that are unable to be tracked in the traditional way, and generating a pixel with the salient points of information we want, and passing them through to GA that way.

Its designed to sit on mobile only websites, like m.whatever.com or mobile.whatever.com – BUT – it does open up some pretty interesting opportunities…

All you need to do is preempt the GA code with a selection criteria, in this case its the request user agent:

spiderlytics

 

Use Case 1 – Tracking Spiders & Bots

track-botsThe original concept was to track search bots through analytics, something that it did rather nicely, however since the original post was published by Cardinal Path a little over two years ago, things have moved on a bit and the code no longer worked.

Yesterday I tracked down the original author Adrian Vender on twitter yesterday and had a chat with him about it. (The power of social media, YAY!)

He has very kindly updated his original source code to work again in 2013 which you can now grab directly from his personal blog here:

Tracking bots using analytics.

There are still a few things to iron out but the code linked to from his post is QA ready.

Specifically ignore the instructions to update your Analytics profile number from UA to MO (thats deprecated, and from the older non-functional version).

Also, the list of bots is a bit outdated, but I’ve built a more complete list that you can download here bots.xlsx

 

Use Case 2 – Tracking Rogue iOS Mobile Devices

iphone_drevil

Another use (untested) for this approach might well be to recover the analytics for those rogue mobile devices that are currently not playing nicely with GA.  After all, the only thing that you need to record the traffic is an accurate list of user-agents which those devices use.

Its not something that I’ve had to deal with professionally yet, so I’ve not bothered trying to get it working, but if its something that has caused you an issue and you want to fork the code please let me know and I’ll update this post and publish it here if you so desire.

 

 

 

 

Other Use Cases:

trackallthethings

When you think about it, there are potentially thousands and thousands of devices these days that don’t necessarily accept or execute javascript, but that do have unique user agents.  This could be stuff like internet enabled watches, through to the new 2013/2014 in dash internet access in many modern cars.

You can also use it to know when you’re being spidered by people using screaming frog, Xenu Linksleuth (for us old fogeys), and it could even alert you to a DDOS attack just by setting up some custom alerts.

watches

 

Basically, most things that you can do with server logs you can now do within the nice familiar frontend of Google Analytics!

 

Semi-obvious caveat: server error pages (ie. 500’s) are still unlikely to track, but thats a small loss compared to the huge potential gain!

 

What use cases can you think of?  Leave a comment below – all ideas appreciated!

 

UPDATE:

Following the comments below from Yousaf Sekander he has emailed me screenshots of his wordpress plugin designed to monitor bot activity:

SEO-crawlytics

You can download said plugin direct from WordPress.org here

Its a fantastic looking plugin, and if you are a WP publisher its a nice super easy way to get bot statistics.  The above GA method will reveal far more bots by default though – so its worth looking at as well, but for ease of use especially if you aren’t comfortable with code, go ahead and install his plugin!

Martin MacDonald
Previously: Head of SEO, Omnicom. Inbound Marketing Director, Expedia. Head of Content & SEO, Orbitz. Currently: Marketing Consultant to Fortune 500's and High Growth Startups locally in Silicon Valley. Retired BlackHat & Current Tech SEO Geek.
Martin MacDonald

@searchmartin

Fortune 500's Digital Marketing Consultant. Former Head of Content & SEO for Expedia / Orbitz / Omnicom. English & Español
"Chromebook": not a verb. https://t.co/aJ728mnSQE - 3 days ago
18 Comments

Comments are closed.

  1. Yousaf 4 years ago

    Nice post! I don’t mean to plugin my own stuff here but I built a WordPress plugin a while ago that tracks and verifies bots.

    You can try it http://www.rocketmill.co.uk/seo-crawlytics

    I found something really interesting, Googlebot sometimes makes some odd requests i.e. port 447 if it discovers that the structure of your site has been changed.

    Also, it amazing to see Google and Bing’s activity on a site….

    I can share some graphs from our own site if anyone was interested.

    • Author
      Martin MacDonald 4 years ago

      I’d be very interested, feel free to email them to me and I will update the post mate, thanks for the comment!

      • Yousaf 4 years ago

        I just emailed you a very big screenshot from my dashboard. The plugin gives you raw data, if you analyse it you will find out exactly how Google/Bing treats your site.

        Look at the following notification from the plugin and notice the 7080 port in the URL, we had changed our domain name and Google was making requests to that port because plesk/nginx was slightly misconfigured. Our crawlers couldn’t replicate the same issue…

        Hello,

        This is a quick notification to let you know that Googlebot-Mobile has visited http://www.rocketmill.co.uk:7080/tag/google-analytics.
        Time-stamp: 2012-08-14
        Referrer:
        Googlebot-Mobile Address: 66.249.66.12

        If you would like to stop receiving these notifications please change your preferences in SEO Crawlytics in your Settings panel.

        Kind regards,
        Crawl Messenger

        • Author
          Martin MacDonald 4 years ago

          Updated the main post with the screenshots received by email mate, thank you!

          • Yousaf 4 years ago

            Awesome, thank you.

  2. Barbio 4 years ago

    Great post, really interesting. I use WordPress so I think I will try the Seo Crawlytics plugins. Sharing and following this post!

  3. Michael 4 years ago

    We have been using an in-house bot logger plugin for WordPress. We actually just got a programmer to add a few new features recently… If we sell it, it will be at botlogger.com 🙂

  4. Giuseppe Pastore 4 years ago

    It looks like a plenty of us have done similar stuff 🙂

    I’ve to say that already in 2009 a PHP script existed to track bots in Analytics (blog.mark8t.com/2009/04/29/track-search-engine-bots-and-spiders-with-google-analytics/)
    And in 2011 I translated it into a WordPress plugin (http://en.posizionamentozen.com/resource/wp-bots-analytics/)

    I guess Adrian script is better than the one I used two years ago; maybe I’ll ask him if we might update the plugin with his code.
    Not every one can edit templates, but everyone can add a plugin with a couple form fields…

    • Author
      Martin MacDonald 4 years ago

      Fantastic resource Giuseppe, thanks for posting the link, I’ll tweet Adrian to let him know!

  5. Giuseppe Pastore 4 years ago

    No problem, Martin. I’m not a great coder and surely Adrian could improve the plugin: I’d be glad if we could put things together and share a better version of it.
    If you both like, my email is info@posizionamentozen.com
    Feel free to get in touch!

    Giuseppe

    • Adrian 4 years ago

      Hi Guiseppe. I’d love to collaborate on combining my code with your WP plugin. I’ll send you a note soon 🙂

  6. Yousaf 4 years ago

    If someone wants to fork my plugin with the intent of improving it and sharing back with the community please feel free to do so.

  7. Dr David Sewell 4 years ago

    I recall using server side tracking for apps on TV too. Thanks for the UA not list – very handy!

    • Author
      Martin MacDonald 4 years ago

      You’re welcome David – until Monday when I started playing with this I’d never really needed a “thorough” list of UA’s… They sure are tough to find!

      This was about as good as I could come up with Im afraid, but some regex is still needed for small variations.

  8. Richard 4 years ago

    Hi Martin, I’m guessing this tracking method might inflate the pageviews/visit count in analytics when it was a bot? Obvious ramifications for conversion rates if so – but correct me if I’m wrong.

    On tracking IOS – don’t those rogue devices refuse to allow you to identity the os – so the best you’ll get is swapping a little direct traffic into ‘unknown-robot’ (then again, I don’t know what that code’s doing – iOS users identifiable or no are still executing the js tracking code – would they appear in this not list?).

    I’m always interested to see which pages a bot requests (and which pages *don’t* get requested) during a log file analysis. What’s often more interesting is bot activity on resource URLs (like js/images/etc).

    Having thought about the idea, I’m not sure this data’s appropriate for my analytics – and its more useful for tracking inside a plugin like yousef’s – what do you think?

  9. Author
    Martin MacDonald 4 years ago

    Hi Richard,

    sure, let me correct you on that:

    The implementation instructions are clear that you must setup a different view (to track the traffic on) and exclude the source from your main GA view. If you take a look at the code it adds utmn, utmr etc. and sets all traffic recorded by the pixel as source=bots. This resolves any inflated pageview problems.

    iOS: As pointed out, its untested and theoretical on the iOS idea. Providing a device passes a UA then the above method words, and you would simply create a different view in GA and exclude it from the main profile.

    (both the above would be answered by reading the implementation instructions, but I guess I could have made it more clear in my post, thank you for pointing that out)

    Usefulness vs. Yousaf’s plugin:
    I can’t comment on your needs… But in my case however I don’t work with clients that use wordpress as a CMS.

    That fact alone means its not an option for me, but this is.

    Martin

    • Richard 4 years ago

      Ah, very cool – a separate profile makes a lot of sense. Thanks for that – I’ll read up.

      I suppose you have to be aware that the device exists already – that a UA string can be found in that php file, so identifying new devices as they emerge on to the market (thus identifying themselves with new UA’s) would be more difficult?

      I’d like to see a working prototype for the IOS tracking, but I can’t currently see a way that could be done. Good luck though mate, I’m sure you’ll crack it.

      There are lots of log file analysis tools (including good old fashioned Excel!). That’s what I always end up using.

      Speak soon!

  10. Francois 4 years ago

    Thanks for this article and for showing Yousaf’s SEO crawlitics plugin.
    It’s perfect to have a quick view of googlebot’s activity on my small blog without doing log analysis or setting up new GA profiles.

CONTACT US

We're not around right now. But you can send us an email and we'll get back to you, asap.

Sending

©2017 WebMarketingSchool.com is a product of MOG Media Inc

Log in with your credentials

or    

Forgot your details?

Create Account