Quickplace
I use a product called iMarcos and this lets me time web page actions. But there are other tools such as topas etc etc.
At the moment I time three different action against each server in the cluster and this is run once an hour 24/7
Then I make one change at a time to ensure I am not making the user experience worst. In my time I have seen people focus on a disk queue legnth and thinking things are getting better but this has just removed a bottle neck that was infact improving the over all user response times. So in short always use your user experience stats as the real measure of improvment.
I took three different measure.
1) How long to logon to the quickplace server
2) How long to load a page
3) How long to update a page
All three actions are then put together to give me a number for that server. I can check each hour of the day for big numbers and focus on finding out the root cause. With this method I have got the response down from 106 seconds on a server to 23 seconds. (Quite an improvement so if you want to know how keep reading).

In this diagram you can see that I got a bench mark. When I looked into the problem it seemed I had disk queue length issues. As I made changes you could see that the improvements worked and the times came down. Then later I make tweaks to make little improvements. Then all of a sudden things go wrong and it was an LDAP issue. But you can see that a measure of what the response time to your service and servers from a user perspective can give you a good handle on issues. It also shows the benefit to your service from your improvments.
I also gather stats on individual servers as well as the overall service. Always make a change to a single server and ensure that that change does not make the user response times get worst. Measure Change and Measure again. Simple but it works.
|