Tuesday, December 2, 2008

Chapter 4: Workload Identification

Software Performance Testing Handbook

A Comprehensive Guide for Beginners

What is Workload?

The Workload refers to the user load created on the server by the real time user access or during the performance tests. As per G.Kotis words, “a workload can be defined as all set of inputs (programs, commands, etc) from the users to the system from the environment”. For example, for a UNIX terminal, the way how the end user enters the system commands in the terminal is the workload. The workload of a specific system varies from the other.

The workload could be either natural or synthetic. The natural or real workload is the natural realistic actual load in the real time production environment. The synthetic workload mimics the real time behavior of the system and hence it models the natural workload. The characteristics of a synthetic workload could be controlled.

The workload selected for carrying out the tests should closely resemble the real time workload pattern of a web application. If there is more deviation from the test workload and the real time production system workload, then the performance test results will not give accurate results of the system performance in production environment.


Web log Analysis

The web log analysis refers to the analysis of the log files of a web server (Example: IIS or Apache web server) and derive metrics about the access pattern of the real time users. There are various formats used for logging the information in a log file. Some of the ASCII file formats include W3C Extended File format, IIS log file format, NCSA common log file format. Different format uses different time zones to represent the activities. (W3C Extended file format uses Coordinated Universal Time (UTC) same as Greenwich Mean Time (GMT).

The sample W3C Extended File format and IIS log file format are provided below.


#Fields: time c-ip cs-method cs-uri-stem sc-status cs-version

Example:

17:42:15 172.16.255.255 GET /default.htm 200 HTTP/1.0

#Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken cs (User-Agent)


Example:

2005-12-11 00:00:22 131.127.217.67 - 131.127.217.69 80 GET /scripts/mycalender.dll Test 200 282 270 0 Mozilla/4.06+ [en] + (WinNT; +I)


Web Log Analysis Tools Overview

As the web server log files cannot be easily interpreted, analyzer tool needs to be used to analyze the access requests and report statistics. There are many web log analytics tools available in the market to analyze the log file contents and to report statistics graphically. The choice of the web log analysis tool depends on the type of information required and budget constraints.

Some of criteria for selecting the log analysis tool are

Ø What do you need to measure?

Ø Is the interface easy to use?

Ø How fast are reports produced?

Ø Are reports and terminology in plain English?

Ø What is the maximum log file capacity that can be analyzed?

Ø How accurate and reliable are the reports?

Ø How often do you require log analysis reports?

Ø Is it possible to access the tool via web?

Ø Is it possible to open the reports without any thick client?

Ø Is it possible to do customization on the reports?

Ø Can we apply filter on specific reports?

Ø Does the tool provide online help? Is the documentation easily understandable?

Ø Does the tool support a variety of web servers?

Ø What the users of the tool say about the tool? Is any comparative rating available which appreciates the usage of the tool?

Ø Is it possible to export the reports to a MS Word/Excel or PDF format?


Web Log Analysis Metrics

The web server log file analysis provides a lot of information about the visitor behavior which helps in identifying the exact load on the system during different time periods. The historic traffic trends and deviations in expected traffic trends can be identified and compared, which helps in business decisions. A lot of people with varying background like Business owners, Domain managers, Site administrators, Performance Engineers, Capacity Planners, etc, use the log analyzer tools to arrive at server load and for user navigation trend analysis.

The following are the some of the data which could be identified from the web log analysis tool.


Number of Visitors

It represents the count of visitors accessing the site at any point of time. A visitor can make multiple visits.The visitors are identified uniquely through the IP address. By default, a visitor session is terminated when a user is inactive for more than 30 minutes. So a unique visitor (with unique IP address) may visit the web site twice and be reported as two visits. The tool should provide information about the visits and the unique visitors at any point of time.


Number of Hits

It represents the count of requests (hits) for any resource (example: image or html page) on a web server at any point of time. The hits are not equal to the number of pages accessed. For example, if a web page contains 5 images, a visit to this page generates 6 hits on the web server i.e. one hit for the web page and 5 hits for the image files on the html page. Identifying the peak hits per second during the peak rush hour will provide a realistic picture of the peak traffic. But most of the tools do not provide the drill down capability to find the peak hits per second data.


Number of Page Views

It represents the count of requests for the web pages (example: .jsp / .html file) on a web server at any point of time. The page views are equal to the number of pages accessed. If a visitor views 3 pages on a web site, then it generates 3 page views on the web server. Each page view might consist of multiple hits, as the page view includes the images files on an html page.


Authenticated Users Report

It represents the list of users whose user names are authenticated as required by the web site. The information including the number of hits generated by the authenticated users and the region or country from where the authenticated users have accessed the web site is also provided.


Peak Traffic Day à Peak traffic Hour

It represents the daily, weekly, monthly server load statistics and drill down capabilities to find the peak traffic day and peak traffic hour in the entire period under analysis. This is identified by referring the number of visitors accessing the site and the hits generated during the access period. Identifying the peak traffic day and peak traffic hour helps to know the maximum traffic supported by the web site.


Throughput / Bytes Transferred

It represents the total number of bytes transferred by the server to the client(s) at any point of time. It is an important metric which portrays the server performance at any point of time. The peak throughput metric helps to know the maximum data that is transferred by the server to resolve the client requests during the peak hours.


Files Downloaded Report

It represents the popularity list of all the files including the web pages, images, media files, etc downloaded by the users. Files are ranked by the number of times they are requested by the visitors (number of hits). This report includes files with all extensions.


Top Downloads Report

It represents the popular files downloaded from the web site with the extensions .zip, .exe, .tar, etc. It does not include the image files, html pages, etc. It also provides the bytes transferred information of how many total bytes of data were transferred by the web server to the visitors for each downloaded file.


HTTP Errors

It represents the HTTP response errors sent by the server during the access period. The summary of HTTP error codes and the time of occurrence is a useful metric to identify the server behavior. The typical errors reported are page not found errors, incomplete download errors, server errors, etc. The following are the five classes of response error codes.

1XX (Informational) : Request received, continuing process.

2XX (Success) : The action was successfully received, understood and accepted.

3XX (Redirection) : The client must take additional action to complete the request.

4XX (Client error) : The request contains bad syntax or cannot be fulfilled.

5XX (Server error) : The server failed to fulfill an apparently valid request.


Most significant Entry & Exit Pages

It represents the top entry and exit pages used by the users to login/logout to/from the web site. Some web sites have more than one option to login (For example: through registered user’s login page, by Google search, by download page, etc) or to logout in different ways. The predominantly used user option can be identified by using this metric.


Browsers

It represents the information about the web browsers used by the users to connect to the server. The set of various browsers used, the user agents and the percentage usage by the user are provided.


Platforms

It represents the information about the operating system used by the users to connect to the server. The set of various platforms used and the usage percentage are provided.


Geographical regions à Top Countries à Top Cities

It represents the top geographical regions, top countries and top cities from which the users accessed the server. It provides the IP details, bytes transferred, hits generated, etc per region, country and city breakup.


Visitor Path Analysis

The visitor arrival rate can be used to identify the statistical distribution pattern (Self Similar distribution, Poisson distribution, Exponential distribution, etc) of the application. This metric is usually not provided by the log analysis tool. This is a user derived metric which needs to be calculated by the performance tester based on the mean arrival rate of the request and the peak server load details.


Visitor average session length

It represents the average visitor session duration in minutes. The session length of visitor groups is provided in a sorted order.


Search Engines and Robots

The Search engines are the information retrieval systems which allow the users to search for specific web sites based on the keyword search. The popular search engines are Google, Yahoo, MSN, ASK.com etc.

A web crawler (also known as a Web spider or Web robot) is a program or automated script which browses WWW in a methodical, automated manner. It is an automated Web browser which follows every link it sees.

The Log Analyzer provides the list of search engine's name and key phrase that referred visitor to the website. The Top Search Engines Report is a list of search engines used by the visitors to find out the website, ranked by the number of referrals from each search engine. The Referring Sites Report shows the referrer websites that drive visitors to your site ranked by the number of hits received from that referrer. The Spider History Report illustrates day-by-day history of search spiders visits.


Overview of Statistical distributions

The Performance Test Engineer needs to identify the underlying statistical distribution of the application’s user request arrival pattern. This helps to extrapolate the server load (requests handled per unit time) for high user load. For example, if two applications (A1 and A2) have the similar target load objective (say 1000 users) but different statistical distribution (say A1 – Poisson and A2 – Self similar), then the applications A1 and A2 needs to be performance tested for different maximum server loads in order to certify the applications for 99.9% availability. The maximum load point differs based on the type of the distribution which is very vital to identify the system performance for that maximum load point. The statistical distribution pattern analysis helps in deciding appropriate load requirement during performance tests.

The probability distributions are a fundamental concept in statistics. Discrete probability functions are referred to as probability mass functions and continuous probability functions are referred to as probability density functions. The term probability functions cover both discrete and continuous distributions. When we are referring to probability functions in generic terms, we may use the term probability density functions (PDFs) to mean both discrete and continuous probability functions.


No comments: