Mainframe production – Keeping users happy is a full-time job
Mainframes are critical for many businesses and need to be available at all times. But keeping the mainframe accessible is not enough. Users expect good response times. They need answers to any specific issues they encounter. Their business activities rely on mainframe operations running smoothly. The mainframe production team focuses on compliance with the Service Level Agreements (SLAs). To achieve their goal, they need to monitor several key performance indicators.
Question #1 – What are we producing?
The mainframe production is made up of two parts: transactional (TP) and batch processing. Generally, the former runs during the day, while the latter runs at night. The first question the production team must answer is what the level of activity for each category is.
Online activity is measured according to the number of transactions executed for a period. You should consider total transactions regardless of the OLTP program used (CICS or IMS/DC).
This volume can also be expressed per second to reflect the activity throughput.
For the batch activity, you should look at the number of batch jobs executed within a given period. Batch jobs are scheduled tasks and should not interfere with online activity.
It is critical to keep an eye on the volume of batch jobs executed during the TP window. A low number of jobs may also be a sign of an incident.
Knowing how many transactions or batch jobs are being executed is good but not enough. Improving quality of service requires an understanding of your mainframe’s business profile:
- Which periods within the month are the busiest?
- When do I have off-peak periods where I could run batch jobs?
- What are the busiest days of the week?
Getting the answers to these questions allows you to make better management decisions.
Keep in mind that a business’ profile is likely to change over time as the business evolves. That is why you should also look at historical data and make the appropriate comparisons.
Question #2 – With which resources?
Every single task executed on the mainframe consumes resources, each in different proportions. In a high-level approach, you should focus on monitoring CPU consumption and storage usage.
Processors are one of the most valuable resources for mainframe production. Make sure their computing power is used on the right activity. The best way to do this is to monitor total CPU time. This is also a great way to detect any overconsumption that could be the result of a loop in a program.
On the storage side, the priority is to determine how much free space is available. You may then want to keep an eye on the activity density within particular critical storage groups. If the activity is not properly dispatched, bottlenecks may occur. To learn more on how to analyze storage activity on the mainframe, read our other blog post “Leverage Storage Analytics To Improve Mainframe Performance”.
Question #3 – What is the quality?
The first indicator of quality is whether services were available for the users. This means OLTP servers need be available during business hours. Successful completion of batch jobs is also a requirement for business continuity. That is where following critical paths helps to ensure TP is ready to resume.
The second indicator of quality is the response times of the various applications. No one likes to wait minutes in front of a screen for answers or to get a job done. Maintaining good response times is a key factor in customer satisfaction. Average response times of the online activity are a great starting point. This offers an overview and helps detect deviations. For greater insights, you should monitor response times at the business application level. By doing so, you know exactly who is experiencing poor response times.
Analyzing WLM is also a great help in understanding the quality of the service provided. This z/OS component provides the performance index for each service class. This index tells you whether the performance goals you set are being met. If the index is greater than 1, the quality of service is lower than expected.
Finally, one aspect of the SLAs is the Mean Time To Repair (MTTR). It is a good idea to keep track of these times. Lowering MTTR is easier if you have access to the right data for investigation.
Question #4 – At what cost?
The mainframe production team has the responsibility to keep users happy but not at any cost. You may even have incentives to remain within a certain budget or meet a certain cost-reduction target. This is where cost-related indicators come into play.
On the mainframe, software accounts for almost 50% of costs. A lot of IBM software is invoiced monthly based on peak Million Service Units (MSUs) consumed. In this case, keeping an eye on MSU consumption is key to controlling mainframe costs.
This consumption peak is calculated for each LPAR, so you may want to focus on the MSU levels at this scope. Yet, for cost-reduction projects, you need to know exactly who is contributing to your billing.
Of course, costs are not limited to software. Hardware investments, human resources and services account for a fair share of your budget. Tracking costs vs budget for each expenditure item is a great addition to cost-control.
Summary – Analyzing mainframe production daily
In most organizations, the working day starts with a production meeting. The aim is to analyze activity over the last 24 hours. This is what an operational dashboard might look like.
- Number of transactions executed
- Average response times
- Number of abends with code details
- % Jobs completion before the end of the window
- Number of jobs executed
- Number of abends with code details
- Global CPU consumption
- Distribution between TP and batch
- Storage availability: free space, I/O density in targeted storage groups
- MSU consumption broken down by LPAR
- Billing peak in the current month