HistoryViewLinks to this page 2014 April 2 | 07:26 pm

This page covers the scenarios that are driving the Availability work that the Automation WG is undertaking.

Introduction

The goals of this workgroup are to describe automation scenarios in the context of availability. The scenarios range from simple operational tasks that are carried out automatically on a day-to-day basis, to high availability scenarios where failover activities ensure that the business recovers quickly or even stays up 24x7 through automation, up to disaster recovery scenarios where automated data replication and site switch procedures help to achieve a continuous or near-continuous availability solution for the business.

With availability, the concept of an Availability Resource is introduced. It has different status information, most importantly, it distinguishes:

  • observed status - in what status is the Availability Resource right now?
  • desired status - in what status should the Availability Resource be?
  • compound status - does observed and desired status match?

Availability Resources can also be collections of other Availability Resources to represent redundancy in various forms, for example a primary member with multiple cold standby backup members. In another use case, an Availability Resource represents a replication group with a primary data source and a secondary copy of it. Replication can be done synchronously or asynchronously.

Finally, Availability Resources have a history. Looking at the history, it is possible to forecast their behaviour in the future, for example, projecting the expected planned downtime of an Availability Resource or to compute a Mean-Time-To-Repair (MTTR) value.

Along with these attributes, a list of advanced automation scenarios come along that all together seem to justify the introduction of the concept of Availability Resources with their own scenarios and specification based on top of Automation.

Scenarios

Provide service to list workloads

Workloads are single entities or groups of entities executed on a server for the purpose of fulfilling a particular business value. Examples are started tasks on a z/OS system, a middleware subsystem consisting of several processes / address spaces or even multi-tiered business applications that can span multiple servers. This scenario is about listing all or selected workloads to retrieve status information or further associated workload-specific details.

Provide service to start and stop workloads

This scenarios builds on the scenario introduced above. Understanding the observed status of a given workload, this scenario is about starting and stopping such a workload in an automated way.

Provide service to obtain redundancy information for a workload

This scenario allows a consumer to retrieve redundancy information for a workload. Basically, this means that a list of members in a workload is returned with the function in terms of redundancy and the status of each individual member. So, for example, one member can be the designated primary (= active) member, while others are backup (= inactive) members.

Provide service to failover primary workload

A workload that has been configured to contain one primary member and one or more backup members is said to be highly available. In the case of planned or unplanned outages, one of the backup members can take over the work of the primary member, if the primary member is not available. This scenario describes the steps to failover the primary workload to one of its backup members in case of a planned outage. Note that the automatic failover in case of an unplanned outage is inherent to the automation system (= service provider) used responsible for keeping this workload highly available.

Provide service to grow and shrink a workload

Certain workloads have rather dynamic characteristics. Under normal situations, the resources required to handle the workload are known and limited to some degree. But there are certain times, for example black Friday or end-of-month processing, where much more resources are required than usual. Vice versa, such workloads might use almost no resources at Sunday morning between 3am and 6am.

In order to still be able to achieve the level of availability and responsiveness expected, while at the same time optimizing resource utilization, such workloads can be grown or shrinked dynamically, based on demand. So, in this scenario, a workload consists of a variable number of members with a predefined maximum. When the demand grows, one or more of the inactive members can be activated up to until the maximum predefined number of members have been activated. When the demand shrinks, one or more of the active members can be deactivated up to the number of currently active members, effectively yielding 0 active member.