Googly Adventures - Samer's Coding Blog: 2014

Monday, July 28, 2014

The case of the disappearing users

Google continues its efforts to thwart me. In today's episode, we find that the number of unique users actually goes down when I increase the date range in Google Analytics.

238 users (formerly known as "unique visitors") in February + March:

254 users in March:

280 users in the last 12 days of March:

274 users in the last 11 days of March:

I believe the last two are correct. Before March 20, you keep increasing the date range and the count of users keeps decreasing. I wanted to see how far back this went (because eventually it would get to 0, right?), and I found that Sept. 12, 2013 - Mar. 31, 2014 shows 230 users, but if I keep going back (e..g., Jan. 1, 2009 - Mar. 31, 2014), it remains unchanged at 230 users. (Sept. 13, 2013 - Mar. 31, 2014 shows 231 users.) I can't figure out any significance of Sept. 13, 2013 or 230 users.

What is important, however, is that March 20 was the first day the site went live and started collecting analytics.

So the lesson here is, when collecting Google Analytics data from a range that includes when your site went live, the beginning of the range has to be that same go-live date.

If you're doing monthly reports and getting data back for the entire months of June, May, April, you have to be careful when you get March -- instead of getting the entire month of March from days 1 - 31, you have to get from March 20 - 31. If you extend the start of a range to include any dates from before a site went live and started tracking, then bogus data ensues. Yay Google!

Monday, July 21, 2014

Auto-starting a Windows service at build time

My coworker Frank just posted a how-to on setting up a Windows service, which inspired this post.

If you have a Windows service as a C# project, you can set it up to start the service automatically anytime you build. You could even have it work only on your computer so others don't do it by accident, etc.

First go in your solution configuration (that little dropdown at the top that usually says "Debug" or "Release") and for your service project, add a new project configuration called "InstallServiceLocally." To make it even more explicit, if you leave the existing solution configuration alone, you can add a new solution configuration and then have that point to a new project configuration "InstallServiceLocally." Then you can change the dropdown from "Debug" to "InstallServiceLocally" and it will run a debug build plus install the service when that value is set in the solution configuration dropdown.

After you add the project configuration, you will need to edit your csproj file (every C# project has one, or if VB.NET use vbproj) and add the post-build event with the installation logic. Paste this in toward the at the end of your csproj, before the closing </Project> tag:

  <PropertyGroup Condition="'$(COMPUTERNAME)' == 'MyComputerNameGoesHere' and '$(Configuration)' == 'InstallServiceLocally'">    <PreBuildEventDependsOn>SetLatestNetFrameworkPath;$(PreBuildEventDependsOn)</PreBuildEventDependsOn>    <PostBuildEventDependsOn>SetLatestNetFrameworkPath;$(PostBuildEventDependsOn)</PostBuildEventDependsOn>    <CleanDependsOn>UninstallDealerOnCmsGoogleAnalyticsImportService</CleanDependsOn>    <LatestNetFrameworkPath>$(WinDir)\Microsoft.NET\Framework\v4.0.30319\</LatestNetFrameworkPath>    <InstallUtilPath>$(LatestNetFrameworkPath)InstallUtil.exe</InstallUtilPath>    <PostBuildEventWithDeployment>      "$(InstallUtilPath)" "$(TargetPath)"
      net start "$(TargetName)"
    </PostBuildEventWithDeployment>    <PreBuildEventWithDeployment>      net stop "$(TargetName)"
      "$(InstallUtilPath)" /u "$(TargetPath)"
      Exit /b 0
    </PreBuildEventWithDeployment>    <PostBuildEvent>$(PostBuildEventWithDeployment)</PostBuildEvent>    <PreBuildEvent>$(PreBuildEventWithDeployment)</PreBuildEvent>  </PropertyGroup>  <Target Name="SetLatestNetFrameworkPath">    <GetFrameworkPath>      <Output TaskParameter="Path" PropertyName="LatestNetFrameworkPath" />    </GetFrameworkPath>  </Target>  <Target Name="UninstallDealerOnCmsGoogleAnalyticsImportService">    <Exec WorkingDirectory="$(OutDir)" Command="$(PreBuildEvent)" />  </Target>

Whenever the solution configuration is set which builds the project in "InstallServiceLocally" configuration, if the computer name you're building on matches what's hardcoded in the project file (or you can remove this check, or do whatever other condition you want or no condition), then at the end of each build, it will stop, reinstall, and restart the service.

After that, if you want to debug, you'll need to attach the debugger to the service exe... So if you really want to debug the service as it's running, you'll probably want to put a sleep or timer of some sort at startup so it doesn't start running before your debugger is attached. Or you can always debug by creating a unit test wrapper around it, and debugging unit-test methods that call your service's internals. (Remember, you don't have to make things public for a unit test project to access them -- just use the [assembly: InternalsVisibleTo("Name.Of.Friend.Assembly")] attribute in the AssemblyInfo.cs file of the project having the members you want to expose. This makes it similar to a friend assembly in Java.

Wednesday, July 16, 2014

Stop changing my data!!

Dear Google,

Every day I do an incremental import of your Analytics data by pulling in the data from the previous full day.
Why is it when I come back the next day and do a new query for the data from the same day, it's sometimes (but all too often) different? Sometimes it's even different when querying the same completed data range twice in the same day.

Your documentation is confusing. I know you do some mysterious processing. I know there are some things related to data sampling that I need to worry about. I know if I give you $150,000 per account per year then I can upgrade to Premium Analytics and not have to worry (as much) about sampling.

Could you please make it clear, for other users who haven't already figured this out the hard way, that anything we query from your API may possibly be an approximation, and is liable to change if we run the same query an hour or a day later?

I have changed my daily import from 4 a.m. Eastern to 7 a.m. Eastern to ensure that any processing you may be doing on the previous day has had three additional hours to finish.

I have changed all my queries to use the highest-precision sampling level.

I am even going to far as to delete any data that was queried with a range including the previous two days, and then I re-import them along with the latest day, on each day I do an import, because I don't have confidence that the data isn't going to change until at least 48 hours after the day is done.

Just please be a little more up-front about this stuff next time, Google, and it will save your users a lot of pain. It's really not cool thinking you're ready to go-live with brand-new reports that use imported Googly data, only to discover that we have a report that is internally inconsistent with itself because one part of it aggregates daily visits over the last 30 days, being the result of 30 separate incremental imports, and another part of the report shows data from the last 30 days aggregated together, being the result of a single query done afresh on each import, using the last 30 days as the date range.

Annoyed,
Samer

_________________________________

The pictures below shows the kind of stuff you get when you do the same query for the same date range twice in Google Analytics. The numbers at the top were done using one query, done yesterday, for the past 30 days. The numbers at the bottom were done using 30 queries of 1 day each over the past 30 days, and then added together. Before you think I'm doing something wrong, when I delete the 30 daily records and do all those 30 queries again in one shot, the aggregated numbers start to match all of a sudden. I can literally run the same query over the same (finished) date range twice and get two separate results. After a while it does stop changing -- probably 48 hours to be safe.

Google Analytics is just giving us trouble heaped on top of trouble!

Friday, June 6, 2014

Googly Channel Redux

Yesterday I mentioned a frustration I was having with Googly Analytics default channel groupings apparently being calculated incorrectly on Google's reports. I was following their exact criteria for calculating the groupings, and they weren't matching with that Google displayed. (I have to bucketize the channels myself even though Google already does because they don't expose channel through their API, grr).

After looking at it some more, I think it's due to AdWords. It's possible to link an Analytics account with an AdWords account. I suspect (but can't confirm yet) that if that link were present, then the two fields "ad distribution network" and "ad format" would possibly be set. They're always "(not set)" right now, and those fields are part of the criteria we use to bucketize the channels.

It's possible that Google knows the ad distribution network and ad format from AdWords, even though the accounts aren't linked, and that they're using that to group the channels, and that even though they do this and know the AdWords values, that they still show us "(not set)" for the AdWords values, which causes us to group our channels into different buckets.

Please quit being evil, googly.

Thursday, June 5, 2014

MSTest testsettings annoyance in automated builds

When you run a test using the MSTest command-line, it outputs a note that it used the default test settings. When searching my PC, I couldn't find anything that seemed to be a default testsettings file (extension .testsettings), so I gave up and assumed if one isn't specified then defaults are hardcoded or come from some location other than a .testsettings file... ok, fine, no big deal.

The problem is when we changed everything over from LINQ-to-SQL to Entity Framework, it caused 10 of our unit tests to start failing in the automated build, but when run locally in Visual Studio, they were succeeding. Turns out the failures also manifested themselves locally through the MSTest command-line. But if you create a dummy .testsettings file (just the shell of the XML with nothing really set), then it works.

Somehow the unspecified-testsettings behavior is different than the behavior when specifying a dummy .testsettings file... Some DLL's weren't getting copied to the right place, and I know you can configure various deployment settings in the testsettings (and configuring deployment and out folders has always been a nuisance in MSTest), but I didn't even have to do any of that -- it just worked with a dummy file. And all our tests now magically pass again on the build server.

Not to mention the nuisance that Microsoft can't seem to settle on a test settings file format or UI for modifying them... They had testrunconfig (which had a GUI editor as part of Visual Studio) and then they changed it to testsettings, and it had a GUI editor so it was fine... Now they (sort of? apparently? it's hard to tell...) deprecated those two and have runsettings instead, which has no GUI editor and doesn't do all the same things as testsettings and testrunconfig.

By the way, where are my trx files now when testing through Visual Studio 2012? The testing interface -- test list, test results, etc. -- has gone way backwards from Visual Studio 2010, and it's quite frustrating. But let's not get started on VS2012 and the ALL CAPS MENU BAR, badly redesigned pending-changes window, various workflow annoyances, etc... That's another bloog post that I'd rather not do.

Now, Microsoft, in your defense, you never claimed any such "Don't be evil" policy, so carry on doing what you do best. :)

P.S.: It's going to be great having Steve Ballmer as an NBA owner -- maybe they can let him give sweaty pre-game pep talks sometimes. "BALLPLAYERS BALLPLAYERS BALLPLAYERS, AAAH!"

ORM Frustration in .NET

I'm very frustrated with .NET ORM's.

We've been using Entity Framework. The syntax is great -- it has very full LINQ support. The auto-refresh model from database feature is nice, but unfortunately comes at the price of a frustrating XML (EDMX) file that is very annoying to merge. Lately I've been making my schema changes in the XML by hand -- not cool. It does have a code-first way of doing it, but I haven't tried it... I'm sure it would be fine though.

We ran into some blocking issues with Entity Framework. For one, it doesn't support multiple databases in the same context -- even if they are on the same SQL Server instance. That one is just a nuisiance. But there were some severe performance issues. In one case, a single LINQ-to-Entities query generated two SQL queries under the hood: the first query took about 30 seconds and returned no data at all (or something meaningless like { { 1 } }, and the second query did all the work and took about a second. Only explanation for that is it was an Entity Framework bug.

We tried converting everything to NHibernate, and the same query ran in a second or so. Other queries also ran faster -- including ones that weren't Entity Framework outliers -- normal ones as well. So that's pretty awesome. The fluent code-first approach is nice, especially when paired with a quick-and-dirty home-grown code generator that queries the SQL database and dumps the POCO's for you. (The NHibernate conversion, along with that bit of awesomeness, was the doing of my colleage Frank.) Unfortunately, I ran into a lot of annoyances with NHibernate. Certain LINQ queries, in which I was doing joins and grouping by multiple columns and then doing aggregate functions in my select, caused NHibernate to barf. In some cases it could be worked around by splitting into two queries (minor performance hit, bit more code), but because of another limitation (too many parameters -- which could probably be avoided if they would just hardcode the query parameters when building their SQL query, as opposed to using a parameter for each one), that isn't always possible. It can also be worked around by passing a hardcoded SQL query to NHibernate -- but that kind of defeats the whole purpose of using an ORM if we're going to throw away all that strong typing and have to deal with typos in our SQL queries being detected at runtime. Another major nuisance is the inability to delete a range -- e.g., delete from tablename where column > criteria. The only way to delete (apart from hardcoding SQL) is to enumerate through each item in the IQueryable and delete it individually -- which is potentially a huge performance hit. It does, however, have multiple-database support, although I haven't played with it yet... Oh, and also, the syntax for executing stored procedures is just ridiculous -- transformer, AliasToBean -- I know it was ported from Java, but it just seems sloppy to have something called beans in a C# method name.

I was getting sick of the NHibernate issues, so I tried Telerik's OpenAccess ORM. Telerik is a decent company that makes .NET controls, a really awesome .NET decompiler (now that Red Gate bought out Reflector and made it no longer free), etc. OpenAccess used to be commercial only, but now it is free (but not open-source, which is really unfortunate). It supports query delete and supposedly has LINQ support as well. It also has a fluent mapping API with a bit of a twist -- it lets you write POCO's without special annotation, and it has a post-processor run after compile (done using MSBuild targets), which actually injects all the database logic code into the POCO's. It does work, but first I ran into a multiple-database issue. It supports multiple schemata, but if you try to prefix also with the database name, it fails. It shouldn't be so hard to implement, in my opinion, because it's really easy to execute a standard SQL query by just prefixing a table with the database and schema name. When I looked at their free (but closed-source) code in their own decompiler, I could see that they had a property for the full name of a table, but even though I was passing it in properly, they weren't setting the internal fields properly to support it, and it ended up giving weird errors. Maybe not the end of the world, but not cool. Especially not cool because the older (paid) version did support multiple databases in the same way (by simply specifying the database.schema name when specifying the table name). A lot of the old docs are still hanging around online, so there's some confusion there because they're practically two different products. One of the joined group-by queries, which worked in Entity Framework but made NHibernate explode, worked as-is in OpenAccess, so I was excited by that... But when I tried another more complicated one (joining four tables together and then grouping and aggregating in LINQ), it failed miserably. I rewrote the query many different ways (using wheres, using explicit joins, using sub-selects instead of joins, etc.), and each way failed with a different cryptic error.

I suppose by default we'll stick with NHibernate for now. It's not too much work switching between ORM's as long as they have LINQ support -- a lot of copy/paste and regex transformations on the POCO's and queries, or writing some code to regenerate tables. I am glad that we're using a solution that is open-source... Maybe one day we'll get bold and fix the issues ourselves (or if we were completely insane, write our own ORM)... And then we can close-source it, sell it (or at least sell consulting services around it) and change our name from DealerOn to ORMon (Mormon without the M).

Wednesday, June 4, 2014

Hello world!

Welcome to my bloog.

I've been working with Google Analytics lately. I wrote a Windows service to import various data from over 1,000 Google analytics profiles (and growing) into our own SQL database. We use the data to run reports against, so we can have our own reports against Googly data. They do have an API which we could use to query against so we wouldn't have to bother keeping our own copy of the data, and that would be happy and nice, except there is a limit of 50,000 queries per day. The company I work for, DealerOn, builds websites for car dealers, and if we averaged it out, we would have to tell the car dealers they can only view reports that consume a maximum of 100 queries per day. I'm sure they would not appreciate it if we told them that. (The reports are not viewable on the car dealer websites, but rather on our own content management system, which the car dealers' employees log in to in order to manage their websites.

The import service has really grown to the point that I could devote quite a few bloog posts to things I've learned from it.

For now I just wanted to mention a couple "don't be evil" things I've noticed while doing this. And I know it's ironic because I'm using Google to host this blog in which I will now proceed to complain about them.

Google Analytics lets you do queries by passing in dimensions and metrics -- basically, x axis and y axis -- the dimensions group the data, and the metrics are the data points to display for each unique group of dimensions. Dimensions include date, year, month, day, hour, minute, source, medium, campaign, browser version, etc., and metrics include such things as number of sessions (formerly called visits), users (formerly called sessions), transactions, page views, time on page, etc.

So some stuff is well-known and can't necessarily be classified into the "don't be evil" part of this. For example, average time on page is going to be underreported (and sometimes zero) because it's based on the time span between when a user loaded a page and when they loaded the next page on your site -- so the last page they visit in a session won't have any time associated with it. Ok, fine -- this is a limitation in the way browsers work, and while there may be workarounds, this is well understood and all analytics engines suffer from this shortcoming. Also, anything involving uniqueness gets annoying because we must rely on Google to aggregate the data and count the uniqueness for us based on the dimensions we supply -- that is, we can't query individual data points from Google such as visitor or session ID, and therefore for any time span we want unique users for, we have to specifically ask Google for it with the specific timespan. This is a huge problem for us when importing data from Google because we can't import the raw data because they don't expose it (my kingdom for an in-house Google Analytics server!) and we obviously can't pre-query for it based on every single permutation of time span; so we end up compromising in our import by doing a separate import of all days, all weeks (and also, Google calculates their weeks in a non-ISO-approved manner, haha), all months, and past 30 days and 31-60 days ago. But wait, that's not all! Because we want to let our customers run reports of current month to date against previous month, without cheating by scaling either one up or down to a full or partial month, we also import the current incomplete month as well as the same number of days in the previous month (e.g., this morning we imported data points May 1-3, 2014 and June 1-3, 2014, so we can have unique users, etc. in our database for those specific timeframes). Ok, all that is understood and well-known and everyone has to deal with it. It probably possible to work around it by passing in custom variables (or now in Universal Analytics, custom dimensions) to specify session and user, and to manage those ourselves using cookies -- and it would be nice to do that eventually. But the amount of data would multiply greatly if we wanted to get all that raw data, and it might not be too easy to do it in keeping with our 50,000 queries limit, so ok we won't count that against their don't be evil...

Don't worry, we're getting to the more sinister stuff. (Fun fact: if you whistle Bobby McFerrin's "Don't worry, be happy" to my 23-month-old girl, she will say "Don't worry. Be happy." and will then futilely attempt to whistle it. Actually, no -- rather, if I whistle it. If you whistle it to her she might look at you funny or possibly kick you or throw a ba-ba at your head and laugh and/or scream -- just a warning.)

Query throttling... Ok, so Google doesn't want us to execute more than 10 queries per second. I understand, so they don't want us DOSing their API, fine. But fortunately, they make it easier for us by allowing us to specify a quotaUser string, and if we specify that as an arbitrary string of our own choosing, we can get 10 queries per quotaUser per second instead of just 10 queries per second globally. So we make our code clever and put in timers to count the queries per second based on quotaUser, and we make sure we never exceed 10 queries per second per quotaUser. But just to be really careful, we also implement exponential backoff as suggested by Google, wherein upon getting a rate limit exceeded error (remember, this should never happen because we're already self-limiting based on timers and quotaUser), we handle it, wait 1 second plus some random milliseconds, then retry, and if that fails, we wait 2 seconds plus random ms, then if that fails retry, up until the last try we wait for 32 seconds plus random ms.

And then it works like a charm for a couple weeks, until all of a sudden it starts failing consistently with rate-limit exceeded errors. Even with the self-limiting. Even with the exponential backoff. And because I'm such a cleverclogs, I wrapped the whole import process in a retry and I just keep doing it until all profiles have imported successfully... And because I'm so clever that I email myself on every errors (because errors are supposed be rare and I want to fix them right away), this results in several hundred thousand emails in my inbox with all these stupid rate-limit exceeded errors, followed by one last email notifying me of the daily quota limit exceeded. So apparently if you keep retrying the same query because their own API doesn't adhere to their own contract, each retry counts against your daily limit. Good to know! So now not only am I dealing with this, but I have to wait until the next day to fix it. (Not true, because I'm still a cleverclogs, I managed to write the code to fix it, commit and deploy it to production, and instead of testing it on dev because I had exceeded my daily quota, I waited until after midnight and was lucky enough that the changes worked.)

The funny thing is, when I was getting the "rate limit exceeded" errors, I went to the Google API's web interface and looked up my daily usage. I saw spikes corresponding to when I was doing my import, but it never ever spiked above 6 queries per second globally.

So I commented out the quotaUser stuff and went back to limiting myself to 10 queries per second globally, which is fine because Google wasn't my bottleneck in the first place. And I changed my exponential backoff to be even slower than what Google recommends. And I noticed I was never setting the OauthToken of my request, so I set that to my credential's Token.AccessToken. (It was always working before, but I figured it might be a good idea to set that.)

So now it works -- no more rate-limit exceeded errors, and no more daily quota exceeded errors (not yet, at least... but if we keep adding dealers we'll need to take action to prevent it again). Was it the access token thing? If so, why was it ever working because that seems like an authorization thing? Does quotaUser just not work at all? But what does it matter because I never even exceeded 10 / second globally? Did slowing down the exponential backoff take care of it? Could be, but I never did put in any logging for how many times I retry. But now it works consistently and I'm just happy it works.

Oh, but we did hit some daily limits even without encountering the rate-limit exceeded nightmare. All it would take would be a couple extra imports in a day, because I found a bug or was developing a new feature and had to do a few extra imports on our development database, plus the fact that every day I ran one import on production and on import on dev so our developers would have accurate reports. (Of course the dev DB should already be synced to the production DB so it shouldn't be an issue in the first place, but let's not go there...) I've seen other reports on the forums about quotaUser not working at all, so I put that under "don't be evil" for sure. Liars, liars, pantalones en fuegos.

So then I implemented a data sync across databases servers from production to dev so I wouldn't have to do the same import twice every day. That's definitely worth a blog post... It was pretty complicated code that had no value to the customer except I had to write it because of Google's daily quota limit. At least it was fun to design and implement. So that's getting close to "don't be evil" territory, although I understand Google doesn't want us to overload their servers.

Here's another "don't be evil": If you request data for the current day and ask for year, it will give you 2014. If you ask for month, it will give you 6, and if you ask for day, it will give you 4. (Today is June 4, 2014).

But if you ask for date, it will give you a string "Date(2014, 3, 1)". It took me a while to figure out that they were zero-indexing only the month, and only in the date dimension, while one-indexing the year, month, and day dimensions, and also the year and day components of the date dimension. Come on, Googly, really? Don't be evil? For a while there I never even considered they would do something so ridiculous, when I got back a date like 2/30, I was just swallowing the errors and continuing with the import, so all my data was off by a month with some data also missing.

Also, there's a 10,000 records per resultset limit. That's not necessarily evil, but it is annoying. So I built a wrapper around all my queries that checks for this and gets all the pages that it needs as additional queries. (Woohoo, first time I had to write a recursive algorithm in I don't know how long!)

Let's see... Way back when I started this and was getting setup with their API, it was confusing figuring out which versions of their API to use. I tried to get them from nuget (which was the only place to get them because even though they were advertised as open-source the source was no where to be found online (DON'T BE EVIL!)), but I couldn't use them because their own API's that were in nuget were referencing the wrong versions of their own dependent DLL's, so I would get runtime errors and it would just not work at all. So I tried older versions from nuget, and nothing would work, until finally I found a project (https://github.com/rmostafa/DotNetAnalyticsAPI) that included the Google Analytics API's from a version that actually worked.

"Let us C"... what else...

The filter regular expressions don't support case-sensitivity, which proved a minor nuisance to me, but which I could workaround. Slightly evil, perhaps, but not the end of the world as we know it.

Ooh, this is the one I was dealing with most recently, and the reason I decided to start this blog to document it.

Default channel definitions... Great, so the business wants to display channel groupings in our reports, instead of medium which I had been doing. That's fine, right? It's been available in the Google Analytics website since last fall, so I can just change my import service to import channel in addition to medium, right? No, the price is WRONG, Bob! They don't expose channels through their API! They only expose it through their reports, so that the business will think "anything I see on the Google Analytics website's reports, we can import and sell to our customers," so of course we already committed to having channel data in our own reports... Of course it wasn't unreasonable for us to assume that Google exposed channels, and it's absurd that they don't...

Ok, do a bit of digging and Google does at least tell you how they calculate their channels (https://support.google.com/analytics/answer/3297892?hl=en) -- mainly based on medium, but also based on source, ad distribution network, social source referral, and ad format. (Of course they could change their default channel grouping calculations without warning, and I believe they've done before because I found a different set of definitions online, also from Google, and then we'd have to change our definitions to match and re-import all the historical data, but whatever... this is the best we can do.) Ok, so our import to get traffic source data already imports date, week, source, medium, campaign, social network, and keyword. So we can just add ad distribution network, social source referral, and ad format, and then calculate channel ourselves on import, and stick it in the database, and it will match Google, right? Brilliant!

So as I'm doing this, and while the business is probably wondering why it's taking me all day just to change medium to channel in my report, I'm of course limited by another bit of evil -- maximum of 7 dimensions per query! This would bump me up to 10 dimensions... But it will just exactly work, because I got rid of the week number and just added code to calculate it myself, then got rid of source and medium and added sourceMedium as one dimension, added adDistributionNetwork and adFormat, and instead of adding social source referral, I just check if socialNetwork anything other than "(not set)", and that brings me to exactly 7 dimensions and I can now calculate channels...

...Which I do, and to great fanfare, we now have Google channel data in our database even though Google doesn't expose it through their API! Good beats evil! Victory!

But wait, there's more!

Channel Group Name	Our database (imported from Google with channel calculated the same way Google says they do it)	Google's own reports
Direct	459	459
Email	2	2
Display	0	3
Other Advertising	0	0
Paid Search	348	345
Referral	156	156
Social	0	0
Other (not matching any default channel group)	9	9

Well what is this here? Google has taken 3 records that should have been in Paid Search and reclassified them to Display! I know they're supposed to be in Paid Search because I went in Google's on reporting GUI and looked up all the dimensions I could, and I can see medium is "cpc" and adDistributionNetwork is not set and adFormat is not set. I triple-checked my code, and I'm grouping exactly the same way Google says they do. I also found a bunch of records in Paid Search that Google had grouped correctly, which also had medium == "cpc" and adDistributionNetwork and adFormat both also not set. I was trying to see if I could discern a pattern to see if perhaps they changed the way they defined their buckets, so that I could make the same change, but there is no pattern. It's got to be a Google bug.

Lo and behold, I found someone else who has encountered the same issue, with a screenshot to prove it:
http://www.seerinteractive.com/blog/new-google-analytics-channel-groupings#definitionofchannels

Don't be evil, Google -- please quit it.

So now what recourse do I have? We have to live with the fact that our data does not and can not match Google's data. Nevermind that our data is correct and Google's is wrong, according to their own definition. If the customer compares they will want it to match, because that's what we are advertising -- bringing Google reports into our platform. Fantastica. (To which my 5 and 4 -year-old sons would both reply "don't say that!" because they can't stand Dora the Explorer.)

Fantastica, Google, fantastica. Now quit being evil or I'll have to make my import service load reports for every profile I want to import and scrape the channel data by reading the UI elements through Selenium or some such foolishness. (No, I would never dream of that, but I can't think of any other way to get that (apparently incorrect) data into our database. So please just forget I mentioned Selenium: it's completely impossible and I never said that at all. Although it probably is impossible to do GUI automation without a logged on user due to session 0 isolation..)

At least I got some blog fodder. Thanks, Google. And thanks for hosting this blog too.

Now quit the evilness wrt Googly Analytics.

(No, wait, no... Thanks for keeping me employed as the Googly import guy.)

Thanks, Google! Be more evil!

(But on the other hand, telling us you're going to shut off Google Voice through XMPP over our OBI VOIP phone adapters, and then we all panic and figure out alternate solutions, and then the kill date comes and it still works? Really, Google? Cut it out...)

Until next time! Hopefully next time I won't update on several months of work in a single blog post, so it will be shorter... Anyone who's read this far deserves a happy-face: :)