Tuesday, December 2, 2008

Chapter 10: Performance Bottleneck Analysis

Software Performance Testing Handbook

A Comprehensive Guide for Beginners


Often Performance Test Engineers feel performance bottleneck analysis is a complex task. Unless you experience bottlenecks and experiment few things, one cannot become an expert in bottleneck analysis by reading articles or books. I agree that it is more of an art than science. But bottleneck analysis and isolation can be made easy when systematically approached and fundamental concepts in Queueing Theory are understood. This chapter explains few things which every Performance Test Engineer needs to have in mind during performance bottleneck analysis.

Scott’s Recommendation on Scatter Charts

Scatter Charts are one of the powerful tools for bottleneck analysis. It provides a quick view of the bottleneck and it is easy for a Performance Test Engineer to explain it to non-technical stakeholders.

As part of the bottleneck analysis, the first chart which every Performance Test Engineer needs to be look for is the response time chart. Look for the trend in the server processing time for each of the transaction or timer provided in the script during the span of the test. Always ignore the metrics collected during ramp up and ramp down as any conclusion should not be made while the server load is changing. Any metric that is collected during the stable constant load period can only be considered for the analysis.

A Scatter Chart can be created by having test time plotted in seconds on X-axis and the response time measure plotted in seconds on Y-axis. Each measurement represents the server response time for a specific transaction or timer provided in the script during the span of the test. A simple scatter chart is provided below. The blue dots represent the response time for loading the system home page and the red dots represent the response time for login transaction.

Performance Test Analysis

Though the performance testing activity is to simulate high loads on the system and to identify the system break points, practically not all the projects will have the objective to find bottlenecks on the system due to time or cost constraints. Many a time it so happens that project stakeholders expect the Performance Engineers to do just an application benchmarking (run the planned tests and report the system behavior), though there is a planned production move after the performance test. In my experience, many a time it happened that I end up convincing the project stakeholders about the business risk in doing application benchmarking without planning for bottleneck analysis and isolation. I have to accept that sometimes I have failed in my attempt and ended up doing benchmarking in spite of project being planned for production move.

It is very essential that every Performance Test Engineer should understand what is required for the application and suggest it to the stakeholders irrespective of limiting to time or cost constraints. However, acceptance is under stakeholder’s discretion.

Whenever any performance tests are run (benchmark tests, load tests, stress tests, etc), the first metric which needs to be looked at is the response time metric. This is the basic metric which needs to be checked to know the server’s processing time, time taken by the server to respond to user requests. Always adopt the practice of merging response time and running users graphs both for analysis and reporting purpose. Then look for the scatter chart pattern and start looking at the suspects.

The second important metric to look at is the server throughput, server’s processing capability in terms of requests handled per unit time or transaction handled per unit time or amount of bytes returned to the users per unit time. For this metric, there are several graphs available in the test tool like Hits per second, Transactions per second, Throughput (in Bytes/second).

I am sure you will not accept why I am talking about hits per second metric as this metric refers to the input load provided to the system and not the server’s output. To understand it better, consider this example. There is a small tank which has an inlet and outlet pipe of same capacity at the top and bottom of the tank respectively. Assume the tank doesn’t have any other leakage; Fill half of the tank by water having outlet pipe closed and then open the outlet pipe and observe the water inflow and outflow. We can observe that the rate of water flowing into the inlet pipe will be the same as the rate of water flowing out of the outlet pipe. This is true as long as the tank is in good stable condition. The same applies to the server during stable state. The incoming flow rate, A (arrivals per unit time) and out coming flow rate, C (completions per unit time) will be equal during the stable state (as long as there is no saturation in the server).

A = C where A is the arrival rate & C is the completion rate

The Hits per second graph provides the details on the requests handled by the server during unit time. For increasing loads, the hits per second should linearly increase and during the constant load period, the hits on the server should be constant. If there is a drop in the hits per second value, then it represents an issue on the server. This represents that the server is not able to handle the incoming requests which represents instability in the server. This issue needs to be cross verified with the server monitoring graphs to look for saturation issues of various service centers.

The errors observed during the tests needs to be analyzed and test data related issues needs to be isolated from the application errors. If test data related issues exist, it needs to be fixed and the test needs to be rerun to check for consistency in the issues.

Performance Testing Vs Profiling

The performance test tools have the capability to pinpoint the user actions which have high response time. By configuring right set of monitoring on the server infrastructure, hardware related bottlenecks could be effectively identified during performance tests. In the case of software bottlenecks, the performance test tools in general does not provide the details like which tier or software component consumes more processing time. Also, the performance test tools do not have the feature to drill down to the method call or component which contributes to high response time. Any issues related to the application code cannot be identified from the performance test tool.

Profiling is a dynamic program analysis which helps in understanding the program behavior by inserting specific code into the program code (instrumentation) to capture the call times, call stack, frequency of method calls, call performance, thread concurrency issues, memory usage limits, garbage collection details and heap behavior during the program execution. The profiling tools hooks into the JVM of the application (Java) and instruments the application classes to monitor its performance hence provide capabilities to show the call chains and helps in identifying the line of code which leads to high processing time for 1 user load. Profilers are considered as unit test tools as it helps in identifying code performance and it would be a best practice to do profiling and address the software issues before subjecting the system to performance tests.


HP Diagnostics

HP Diagnostics tool facilitates drill down capabilities to identify the root cause for high response time transactions of the system. It helps to identify slow performing components, memory leaks, thread contentions, slow SQL query, slow layers and concurrency issues during the time of the load tests. The ability to provide the server performance behavior during the loaded condition makes HP Diagnostics tool special and different from other profiler tools like Dev Partner, Glow code, etc. The Diagnostics tool can be integrated with the HP performance test tools (Load Runner, Performance Center), the same performance test scripts used for performance testing could be used to create the required load on the server and the server performance characteristics could be monitored and diagnosed.




6 comments:

raj said...

Hi,

The article is very good. I need to know what are the graphs that can be merged with Response time, hits per second and throughput.

-Rajneesh

ಚಂದ್ರ - (Chandru) said...

Hi..

Please Provide the missing Example which is Indicated regarding Scattered Chart.

Thank you
Chandru

Anonymous said...

Hey Ramya ,

Nice article.

There are also patterns we can see with scatterd charts , which help in analysis. May be you can write about that as well.

Regards
Viswanath

Anonymous said...

Hey,
Nice to have all these things in a single window.
Thanks a lot.

Thanks you,
G.Viswanath

QA Kranthi said...

Hi Ramya...
I am going thru with ur blog frm last year(of course this is my second comment :).. really very nice blog for Perf testers...
the book u released is very good...
Can you please expain us about timers in Laodrunner? why and how to insert and use of Timers.

Thanks in advance

Rajesh said...

Hi Ramya

The Article is really good and you have the Information about the performance testing is clear and Precise Manner thanks so much .This is rajesh Parimkayala I am working as a software Test analyst in Department of education australia, i want to move over into performance testing
I will certainly visit blore when i come to india and seek more knowledge from you
Thanks Once again