Capacity and Performance monitoring – Pay attention

When you build a house or a bridge or a skyscraper, we pay attention to information like, will the structure be able to support the load that goes on it? Will it hold up over time? Will it meet the needs of the homeowner or road traffic or businesses that are housed, in terms of space, time to get in and get out, so on and so forth.

Maybe i’m going overboard with the examples but it seems to me that capacity planning and performance monitoring would play a key part in any project being undertaken. Yet i think in IT, these functions get taken for granted or are ignored till later in the game. Very frequently, the people defining the requirements (me) or designing the implementation do not know how to estimate or predict these potential impacts to the systems (or pay lip service to it). Very often we react to issues after they come up in Production… and the response could be to turn off the feature after having invested time and money on it.

Recently I had a conversation with a member of our Capacity and performance monitoring (CPM) team. That group gets no love IMHO.

We have major feature enhancements that go into our Products. A lot of times, these new enhancements have potential to stress our system limits. The CPM team is responsible for predicting these potential issues. Load/Performance testing conducted during a Certification period is done within a controlled environment. The tests results detect systems issues that then get fixed. Finally the release happens. We hope everything goes well but the fact is all the testing happened withing to our network and within simulated test conditions.

The CPM team then starts their magic – monitoring the effects of the release in Production. This is now real world numbers. The system is monitored and reports are produced daily, weekly or monthly. What happens to those reports? I have no idea.

Every once a while, I get curious when preparing requirements for the next enhancement and we get with CPM team to find out how some page has been performing for a certain time period. The report then gets forgotten again. CPM team identifies spikes in usage, is able to point out times when the system was stressed or maybe even events that led to an outage. Usual firefighting happens, things get solved and we move on.

It is my opinion that, just like the usage reports we rely on to justify certain feature enhancements, the Production capacity and performance reports contain valuable pieces of information that need to be extracted and fed back into the ideation sessions. Why did the spike happen? Why did the system go down when it did. Is there a trend in production? Are we utitlizing our hardware capacities to the max or are we under-utilizing them? I think there are number that need to be looked at and decisions to be made.

The hard part is looking at all the number and graphs and making sense of it. So we tend to put them aside for a later time and never get back to it. We need to invest some time here to see where it takes us.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: