Insight and analysis on the data center space from industry thought leaders.
In-depth: The IT Performance Concept
There is more to performance than just speed.
February 14, 2022
If you search the Internet for “performance” or even “performance theory” you will have a large number of hits which, on examination, are concerned with theater, music or employee business performance. This means that quoting from dictionaries can be misleading in this particular subject; in fact, an Internet search on this will yield some hits which look like IT topics, but are actually about personal performance in work or the arts.
Performance in an IT sense has connotations of speed or quickness of response to a query or request for some work to be processed. However, there is more to performance than just the speed concept which takes account of the fact that this quickness will vary about some mean value. Thus, a specification of a desired performance criterion might be “the responses must be in the range 1.5 to 2.5 seconds when measured at the user workstation.”
This might be difficult to guarantee and other forms of words are used, such as “90% of the responses must be 2.5 seconds or less.” This is called a percentile definition. In online work, a mean response time needs another desirable characteristic – that of consistency. At first glance, a mean response time of 2.3 seconds has to be better than one of 2.5 seconds, n 'est pas? Not so I'm afraid.
The Paradox: Imagine the 2.3 second response is made up of widely varying responses, say, 0.1, 7.1, 3.4, 0.4, 11 ... and so on but averaging out at 2.3. The user will not be pleased with this since a) it is annoying and b) it will affect the fluency of his/her work and hence the effectiveness of it. If the 2.5 second response time is composed of 2.0, 2.4, 2.7, 2.5 ... then the user will be much happier because of the consistency of the response time of a transaction. Q.E.D.
The other form of work is the batch job where the elapsed time to complete it is the main criterion and is the batch equivalent of the online response time.
What can we do about performance?
This depends on whether we are talking about pre-operational or post-operational periods and activities. There are things which can be done (or not) in both phases of a performance project to ensure its success. It is well known that the cost of correcting errors in operations phases is more expensive than in the planning and design stages and this should be borne in mind throughout the exercise.
Pre-operational options
Ignore performance altogether (not recommended);
Establish the need for it and quantify the requirement for the workload in question. One “size” does not fit all. Design for it in relevant areas (servers, storage, network) to meet requirements, normally specified in a service level agreement (SLA) and organized by business priority;
Predict it analytically, by simulation or by rule of thumb (ROT). ROTs are usually based on received wisdom or an organization's own experience;
Benchmark it or relate it to standard benchmarks;
Document it (recommended) and amend in the light of experience.
Note: If the pre-operational aspects of performance planning are skipped or relegated to “couldn't care less” priority, it is almost certain that there will be problems and any solution will almost certainly be in the sticking plaster class, which is an implementation of the “make do and mend” architecture.
We shall see very shortly that cutting corners at the design stage for financial or easy life reasons may cost more to correct later than to get it right first time (Figure 4).
Today, this may not be so onerous a task since over-configuring is not as expensive as it used to be but is less professional and more wasteful.
Figure 1: Pre-installation performance activity
Note: The figure is not all encompassing but a high level view of activities and within these there will be sub-activities. For example, the Application & System Design panel might include a series of activities for program development; feasibility, design, code, walkthrough, test and so on.
Post-operational options
Ignore performance altogether (not recommended);
Measure (monitor) it and plot it (makes a pretty picture but useless unless you can interpret it sensibly);
Measure it, plot it and make a record for posterity (noble, but useless);
Manage it (measure, record, analyze, act and plan). The recommended route.
A key performance concept in this phase is performance management, which we touch on in several areas throughout the book referenced at the end of this article. It has several parts to it as expressed by the pseudo-equation:
Management = Design + Measurement + Analysis + Action
For measurement, we can use the frequently employed term monitoring which itself is made up of several parts as shown in the next diagram.
Terry Fig 2
Figure 2: Post-installation performance activity
The times comprising a response time are as follows:
TR is the total response time (user to system to user) for a unit of work which in most cases will be a transaction, a query or a batch job
TN is the network transit time both ways, in and out
TS is the CPU time in the server (processor) (This is a tricky area and depends on what is measured by a monitor and called “server (processor) time”. If CPU time only is reported, then disk time needs to be added, otherwise it is part of 'time spent in the processor' and Total Response Time TR = TS + TN)
TD is the storage access time
So, in the simple case where the server, disk and network represent our system, the following relationship holds:
Total Response Time TR = TS + TN + TD
However, the three times which make up TR are themselves composed of other times, some of which are shown schematically in the figure as tX where X represents the various subsystems which make up the server, network and storage resources. (The ones shown are illustrative and not meant to be an exhaustive list of subcomponents.) Thus, as an example,
TN = tL + tR + tF + similar terms for other network elements
Where subscript L stands for the link (wires/cable/fiber), R for router and F for firewall in our simple example.
There will be similar relationships for TS and TD involving the service times, t, of their subcomponents. If there are other resources in the overall system which contribute to the service (and hence response) times, they would be included.
Very interesting, I hear you say, but so what? In books and numerous papers you will see the term “visibility” which in essence means how well do you know the route between the user and the system and back.
Such intimate knowledge of the landscape means using monitors to see what is going on at various points which might be bottleneck points for access to various IT resources.
Terry Fig 3
Figure 3: Hierarchy of total response time contributions
These operational activities should tie in with change management since uncontrolled changes are often a cause of extra demands on resources which must be visible. Unauthorized, but nominally valid, user access can usually be detected via consistently higher resource consumption. I know; I've seen it happen.
Why? The more proactive of these activities are necessary to balance the demands of service level agreements (SLAs) against feasibility and costs.
Terry Fig 4
Figure 4: Cost and change as a project progresses
To take a lightning look at SLA requirements we might note that three key elements in modern SLAs should be performance, availability and security and each will have a diagram like the two above (Figures 1 and 2) but for availability and security aspects in the areas of design, monitoring, analysis and corrective actions.
Figure 4 above illustrates schematically the costs involved in making changes to aid performance at various stages of choice and development. For example, finding the architecture(s) chosen are incapable of delivering the required performance involves a great expense, followed by a large expense involved in changing an application either by rewrite or a new COTS (Commercial Off The Shelf) application. Later changes, such as operating system or compiler and other system parameters present less effort and cost.
The moral of this story and diagram is to ‘get it right first time’ or incur large expenses in later corrective actions – the later, the more cost involved depending how far you need to backtrack. Remember, a stitch in time saves nine. Figure 4 means that if, as a result of poor design and/or planning you have to retrace your steps through OS software, hardware, application and finally architecture, you are faced with ever-increasing costs of repair, possibly exponential.
Summary
The key points about performance to take away are:
If you can't measure it, you can't manage it, if you can't manage it you are probably in trouble;
Response times matter to the user, especially the user of a public site for ordering goods and services. Studies have shown that the patience of a user with web site response is of the order of a very few seconds after which they abandon the site and possible shop elsewhere;
Performance is judged against a baseline, which is the objective time. If you don't have an objective, you can't have a response time problem;
Consistent response times are important for productivity and marketing reasons;
Remember that management is not just monitoring (measurement);
Performance Management = Design + Measure/Monitor + Analysis + Action.
Dr. Terry Critchley is a retired IT consultant and author who’s worked for IBM, Oracle and Sun Microsystems with a short stint at a big UK Bank.
This blog is created from extracts from his book: High Performance IT Services.
About the Author
You May Also Like