By Joe Beernink on June 15, 2010

The systems development world today is all about instant gratification.  Web search results must return in 1.2 seconds or less.  Web pages must load in less than 3 seconds.  Search results not only must find every result for ‘Flugenschnitzel’, it must show the most popular references to ‘Flugenschnitzel’ in the last 30, 60 or 90 days.  All of these things matter to users.  People want the most up to date information possible, and they want it now.

But what do you do what wanting up to date and means that the page take too long to load?  You make an intelligent choice. 

For instance, if you are calculating the most popular product on a site, if the ranking of that product is five minutes old, it usually doesn’t matter.  If you are calculating the remaining amount in a bank account, it does matter, so the following approach doesn’t apply, and the user may need to wait a fraction of a second longer because they need to see absolutely accurate information all the time.

We recently had scenario A.  There were three lists of products on a page, sorted by Most Recently Updated Products, Most Popular Products, and Featured Products.  The page as originally written and tested here on our development servers took no time at all to load.  But when we moved the site to Windows Azure, and put it under load, performance dropped dramatically.   It didn’t take long to understand why.  There was a greater latency between the client and the server and the server and the database.  Each round trip we were making to the database or to the server came at a cost.  We couldn’t control the round trip latency, but we could control the number of trips. We started to tune the result set to reduce the amount of data needed for each list.  We improved our client side caching so that fewer graphics were being brought down. 

But one query kept slowing down the page refresh, and it got worse as more data was added to the site.  It was the Most Popular Product query.  What was happening was that every time the page refreshed, the popularity of all records was being calculated.  Obviously not a good idea now, but when we laid out the site, it was a minor query, and a minor part of the site.  We didn’t tune it until we needed to, which is arguably a good thing.

We considered our options.  We could have the popularity update every time someone used or viewed the product, but that would cause those actions to be slower.  We could cache the results for a few minutes on the server side, but someone would have to pay the price for the slow query, and that would also be bad.

We decided to create a new worker role in Azure that wakes up every few minutes, calculates the popularity (and a few other key metrics), and then goes back to sleep.   The worker role runs out of sight of the user, and populates a flattened metrics table that is easily and quickly queried.  The views used by the MVC application never had to change, but we did change the data access layer to use these new fields.  It took about four hours to build and deploy the new Accumulator, and as soon as the deployment was complete and the stats updated, the site went from 7 seconds for the home page refresh to less than 1 second.  The users never know, and rarely care that the list is a few minutes old.  They want the instant gratification of a quick page load.

Doing batch programming like this isn’t anything new.  And although the solution was quick and easy using an Azure worker role, it could be a little better if it were a scheduled job instead of one that was always running and sleeping.  The Azure platform charges by CPU usage, so it would be really nice if the job wasn’t always there running up the charges.  But in this circumstance, we had a site that was unusable due to performance, and the ability to deploy this type of solution quickly really kept the project on time and let us focus on enhancing the user experience instead of worrying about infrastructure.