Wednesday, June 4, 2014

Hello world!

Welcome to my bloog.

I've been working with Google Analytics lately. I wrote a Windows service to import various data from over 1,000 Google analytics profiles (and growing) into our own SQL database. We use the data to run reports against, so we can have our own reports against Googly data. They do have an API which we could use to query against so we wouldn't have to bother keeping our own copy of the data, and that would be happy and nice, except there is a limit of 50,000 queries per day. The company I work for, DealerOn, builds websites for car dealers, and if we averaged it out, we would have to tell the car dealers they can only view reports that consume a maximum of 100 queries per day. I'm sure they would not appreciate it if we told them that. (The reports are not viewable on the car dealer websites, but rather on our own content management system, which the car dealers' employees log in to in order to manage their websites.

The import service has really grown to the point that I could devote quite a few bloog posts to things I've learned from it.

For now I just wanted to mention a couple "don't be evil" things I've noticed while doing this. And I know it's ironic because I'm using Google to host this blog in which I will now proceed to complain about them.

Google Analytics lets you do queries by passing in dimensions and metrics -- basically, x axis and y axis -- the dimensions group the data, and the metrics are the data points to display for each unique group of dimensions. Dimensions include date, year, month, day, hour, minute, source, medium, campaign, browser version, etc., and metrics include such things as number of sessions (formerly called visits), users (formerly called sessions), transactions, page views, time on page, etc.

So some stuff is well-known and can't necessarily be classified into the "don't be evil" part of this. For example, average time on page is going to be underreported (and sometimes zero) because it's based on the time span between when a user loaded a page and when they loaded the next page on your site -- so the last page they visit in a session won't have any time associated with it. Ok, fine -- this is a limitation in the way browsers work, and while there may be workarounds, this is well understood and all analytics engines suffer from this shortcoming. Also, anything involving uniqueness gets annoying because we must rely on Google to aggregate the data and count the uniqueness for us based on the dimensions we supply -- that is, we can't query individual data points from Google such as visitor or session ID, and therefore for any time span we want unique users for, we have to specifically ask Google for it with the specific timespan. This is a huge problem for us when importing data from Google because we can't import the raw data because they don't expose it (my kingdom for an in-house Google Analytics server!) and we obviously can't pre-query for it based on every single permutation of time span; so we end up compromising in our import by doing a separate import of all days, all weeks (and also, Google calculates their weeks in a non-ISO-approved manner, haha), all months, and past 30 days and 31-60 days ago. But wait, that's not all! Because we want to let our customers run reports of current month to date against previous month, without cheating by scaling either one up or down to a full or partial month, we also import the current incomplete month as well as the same number of days in the previous month (e.g., this morning we imported data points May 1-3, 2014 and June 1-3, 2014, so we can have unique users, etc. in our database for those specific timeframes). Ok, all that is understood and well-known and everyone has to deal with it. It probably possible to work around it by passing in custom variables (or now in Universal Analytics, custom dimensions) to specify session and user, and to manage those ourselves using cookies -- and it would be nice to do that eventually. But the amount of data would multiply greatly if we wanted to get all that raw data, and it might not be too easy to do it in keeping with our 50,000 queries limit, so ok we won't count that against their don't be evil...

Don't worry, we're getting to the more sinister stuff. (Fun fact: if you whistle Bobby McFerrin's "Don't worry, be happy" to my 23-month-old girl, she will say "Don't worry. Be happy." and will then futilely attempt to whistle it. Actually, no -- rather, if I whistle it. If you whistle it to her she might look at you funny or possibly kick you or throw a ba-ba at your head and laugh and/or scream -- just a warning.)

Query throttling... Ok, so Google doesn't want us to execute more than 10 queries per second. I understand, so they don't want us DOSing their API, fine. But fortunately, they make it easier for us by allowing us to specify a quotaUser string, and if we specify that as an arbitrary string of our own choosing, we can get 10 queries per quotaUser per second instead of just 10 queries per second globally. So we make our code clever and put in timers to count the queries per second based on quotaUser, and we make sure we never exceed 10 queries per second per quotaUser. But just to be really careful, we also implement exponential backoff as suggested by Google, wherein upon getting a rate limit exceeded error (remember, this should never happen because we're already self-limiting based on timers and quotaUser), we handle it, wait 1 second plus some random milliseconds, then retry, and if that fails, we wait 2 seconds plus random ms, then if that fails retry, up until the last try we wait for 32 seconds plus random ms.

And then it works like a charm for a couple weeks, until all of a sudden it starts failing consistently with rate-limit exceeded errors. Even with the self-limiting. Even with the exponential backoff. And because I'm such a cleverclogs, I wrapped the whole import process in a retry and I just keep doing it until all profiles have imported successfully... And because I'm so clever that I email myself on every errors (because errors are supposed be rare and I want to fix them right away), this results in several hundred thousand emails in my inbox with all these stupid rate-limit exceeded errors, followed by one last email notifying me of the daily quota limit exceeded. So apparently if you keep retrying the same query because their own API doesn't adhere to their own contract, each retry counts against your daily limit. Good to know! So now not only am I dealing with this, but I have to wait until the next day to fix it. (Not true, because I'm still a cleverclogs, I managed to write the code to fix it, commit and deploy it to production, and instead of testing it on dev because I had exceeded my daily quota, I waited until after midnight and was lucky enough that the changes worked.)

The funny thing is, when I was getting the "rate limit exceeded" errors, I went to the Google API's web interface and looked up my daily usage. I saw spikes corresponding to when I was doing my import, but it never ever spiked above 6 queries per second globally.

So I commented out the quotaUser stuff and went back to limiting myself to 10 queries per second globally, which is fine because Google wasn't my bottleneck in the first place. And I changed my exponential backoff to be even slower than what Google recommends. And I noticed I was never setting the OauthToken of my request, so I set that to my credential's Token.AccessToken. (It was always working before, but I figured it might be a good idea to set that.)

So now it works -- no more rate-limit exceeded errors, and no more daily quota exceeded errors (not yet, at least... but if we keep adding dealers we'll need to take action to prevent it again). Was it the access token thing? If so, why was it ever working because that seems like an authorization thing? Does quotaUser just not work at all? But what does it matter because I never even exceeded 10 / second globally? Did slowing down the exponential backoff take care of it? Could be, but I never did put in any logging for how many times I retry. But now it works consistently and I'm just happy it works.

Oh, but we did hit some daily limits even without encountering the rate-limit exceeded nightmare. All it would take would be a couple extra imports in a day, because I found a bug or was developing a new feature and had to do a few extra imports on our development database, plus the fact that every day I ran one import on production and on import on dev so our developers would have accurate reports. (Of course the dev DB should already be synced to the production DB so it shouldn't be an issue in the first place, but let's not go there...) I've seen other reports on the forums about quotaUser not working at all, so I put that under "don't be evil" for sure. Liars, liars, pantalones en fuegos.

So then I implemented a data sync across databases servers from production to dev so I wouldn't have to do the same import twice every day. That's definitely worth a blog post... It was pretty complicated code that had no value to the customer except I had to write it because of Google's daily quota limit. At least it was fun to design and implement. So that's getting close to "don't be evil" territory, although I understand Google doesn't want us to overload their servers.

Here's another "don't be evil": If you request data for the current day and ask for year, it will give you 2014. If you ask for month, it will give you 6, and if you ask for day, it will give you 4. (Today is June 4, 2014).
But if you ask for date, it will give you a string "Date(2014, 3, 1)". It took me a while to figure out that they were zero-indexing only the month, and only in the date dimension, while one-indexing the year, month, and day dimensions, and also the year and day components of the date dimension. Come on, Googly, really? Don't be evil? For a while there I never even considered they would do something so ridiculous, when I got back a date like 2/30, I was just swallowing the errors and continuing with the import, so all my data was off by a month with some data also missing.

Also, there's a 10,000 records per resultset limit. That's not necessarily evil, but it is annoying. So I built a wrapper around all my queries that checks for this and gets all the pages that it needs as additional queries. (Woohoo, first time I had to write a recursive algorithm in I don't know how long!)

Let's see... Way back when I started this and was getting setup with their API, it was confusing figuring out which versions of their API to use. I tried to get them from nuget (which was the only place to get them because even though they were advertised as open-source the source was no where to be found online (DON'T BE EVIL!)), but I couldn't use them because their own API's that were in nuget were referencing the wrong versions of their own dependent DLL's, so I would get runtime errors and it would just not work at all. So I tried older versions from nuget, and nothing would work, until finally I found a project (https://github.com/rmostafa/DotNetAnalyticsAPI) that included the Google Analytics API's from a version that actually worked.

"Let us C"... what else...

The filter regular expressions don't support case-sensitivity, which proved a minor nuisance to me, but which I could workaround. Slightly evil, perhaps, but not the end of the world as we know it.

Ooh, this is the one I was dealing with most recently, and the reason I decided to start this blog to document it.

Default channel definitions... Great, so the business wants to display channel groupings in our reports, instead of medium which I had been doing. That's fine, right? It's been available in the Google Analytics website since last fall, so I can just change my import service to import channel in addition to medium, right? No, the price is WRONG, Bob! They don't expose channels through their API! They only expose it through their reports, so that the business will think "anything I see on the Google Analytics website's reports, we can import and sell to our customers," so of course we already committed to having channel data in our own reports... Of course it wasn't unreasonable for us to assume that Google exposed channels, and it's absurd that they don't...

Ok, do a bit of digging and Google does at least tell you how they calculate their channels (https://support.google.com/analytics/answer/3297892?hl=en) -- mainly based on medium, but also based on source, ad distribution network, social source referral, and ad format. (Of course they could change their default channel grouping calculations without warning, and I believe they've done before because I found a different set of definitions online, also from Google, and then we'd have to change our definitions to match and re-import all the historical data, but whatever... this is the best we can do.) Ok, so our import to get traffic source data already imports date, week, source, medium, campaign, social network, and keyword. So we can just add ad distribution network, social source referral, and ad format, and then calculate channel ourselves on import, and stick it in the database, and it will match Google, right? Brilliant!

So as I'm doing this, and while the business is probably wondering why it's taking me all day just to change medium to channel in my report, I'm of course limited by another bit of evil -- maximum of 7 dimensions per query! This would bump me up to 10 dimensions... But it will just exactly work, because I got rid of the week number and just added code to calculate it myself, then got rid of source and medium and added sourceMedium as one dimension, added adDistributionNetwork and adFormat, and instead of adding social source referral, I just check if socialNetwork anything other than "(not set)", and that brings me to exactly 7 dimensions and I can now calculate channels...

...Which I do, and to great fanfare, we now have Google channel data in our database even though Google doesn't expose it through their API! Good beats evil! Victory!

But wait, there's more!

Channel Group NameOur database (imported from Google with channel calculated the same way Google says they do it)Google's own reports
Direct459459
Email22
Display03
Other Advertising00
Paid Search348345
Referral156156
Social00
Other (not matching any default channel group)99

Well what is this here? Google has taken 3 records that should have been in Paid Search and reclassified them to Display! I know they're supposed to be in Paid Search because I went in Google's on reporting GUI and looked up all the dimensions I could, and I can see medium is "cpc" and adDistributionNetwork is not set and adFormat is not set. I triple-checked my code, and I'm grouping exactly the same way Google says they do. I also found a bunch of records in Paid Search that Google had grouped correctly, which also had medium == "cpc" and adDistributionNetwork and adFormat both also not set. I was trying to see if I could discern a pattern to see if perhaps they changed the way they defined their buckets, so that I could make the same change, but there is no pattern. It's got to be a Google bug.

Lo and behold, I found someone else who has encountered the same issue, with a screenshot to prove it:
http://www.seerinteractive.com/blog/new-google-analytics-channel-groupings#definitionofchannels

Don't be evil, Google -- please quit it.

So now what recourse do I have? We have to live with the fact that our data does not and can not match Google's data. Nevermind that our data is correct and Google's is wrong, according to their own definition. If the customer compares they will want it to match, because that's what we are advertising -- bringing Google reports into our platform. Fantastica. (To which my 5 and 4 -year-old sons would both reply "don't say that!" because they can't stand Dora the Explorer.)

Fantastica, Google, fantastica. Now quit being evil or I'll have to make my import service load reports for every profile I want to import and scrape the channel data by reading the UI elements through Selenium or some such foolishness. (No, I would never dream of that, but I can't think of any other way to get that (apparently incorrect) data into our database. So please just forget I mentioned Selenium: it's completely impossible and I never said that at all. Although it probably is impossible to do GUI automation without a logged on user due to session 0 isolation..)

At least I got some blog fodder. Thanks, Google. And thanks for hosting this blog too.

Now quit the evilness wrt Googly Analytics.

(No, wait, no... Thanks for keeping me employed as the Googly import guy.)

Thanks, Google! Be more evil!

(But on the other hand, telling us you're going to shut off Google Voice through XMPP over our OBI VOIP phone adapters, and then we all panic and figure out alternate solutions, and then the kill date comes and it still works? Really, Google? Cut it out...)

Until next time! Hopefully next time I won't update on several months of work in a single blog post, so it will be shorter... Anyone who's read this far deserves a happy-face: :)

No comments:

Post a Comment