Mastering Cloud Operations Requirements

Operations management needs to master six new capabilities to deliver on the promise of cloud.


1. Operate on the “Pools” of Compute, Storage, and Memory

Traditionally, operations management solutions have provided coverage for individual servers, storage arrays, or network devices. With the cloud, it becomes imperative to operate at the “pool” level. You have to look beyond what can be monitored at the individual device level. Operations organizations must ensure that they have immediate access to the operational status of the pool. That status could be aggregated by workload (current usage) and capacity (past usage and future projections). Perhaps more importantly, the status needs to accurately reflect the underlying health of the pool, even though individual component availability is not the same as pool availability. The operations management solution you use should understand the behavior of the pool and report the health status based on it. 

2. Monitor Elastic Service

Elasticity is central to cloud architectures, which means that services can dynamically expand and contract based on demand. Your operations management solution must adapt to this dynamic nature. For example, when monitoring the performance of a service, monitoring coverage should expand or retract with the service — automatically. This means that a manual process cannot be used to figure out and deploy monitoring capabilities to the target. Your operations management solution needs to know the configuration of that service and automatically deploy or remove necessary agents. Another important consideration is coverage for both cloud and non-cloud resources. This is most critical for enterprises building a private cloud. Why? Chances are that not every tier of a multitier application can be moved to the cloud. There may be static, legacy pieces, such as a database or persistent layer, which are still deployed in the physical boxes. Services must be monitored no matter where resources are located, in the cloud or on premises. In addition, a management solution should natively understand different behavior in each environment. When resources are located in both private and public clouds, your operations solution should monitor services in each seamlessly. It should also support inter-cloud service migration. At the end of day, services must be
monitored no matter where their resources are located. Your operations management solution must know their location and understand the behavior of services accordingly.

3. Detect Issues Before They Happen

Compared to workloads in the traditional data center, workloads in the cloud exhibit a wider variety of behavioral issues due to their elastic nature. When service agility is important, relying on reactive alerts or events to support stringent SLAs is not an option — particularly for service providers. You need to detect and resolve issues before they happen. Yet, how do you do that? First and foremost, you should implement a monitoring solution that knows how to learn the behavior of your cloud infrastructure and cloud services.
While this technology exists in the traditional data center, device-level behavior evolves more rapidly and with less conformity in the cloud. That’s why your solution should have the ability to learn the behavior of abstracted
resources, such as pools, as well as service levels that are based on business key performance indicators (KPIs). Based on those metrics, the solution should give predictive warnings to isolate problems before they affect your customer. To further pinpoint problems, operations should conduct a proper root cause analysis. This becomes even more critical in the cloud, where large numbers of scattered resources are involved. This information might manifest itself as a sea of red alerts suddenly appearing in a monitoring dashboard. Even though one may be a critical network alert, chances are you are not going to notice it. Your operations management solution should
intelligently detect the root cause of an issue with the cloud infrastructure and highlight that network event in your dashboard, while also invoking your remediation process.

4. Make Holistic Operations Decisions

In the cloud, you have to manage more types of constructs in your environment than in the traditional IT environment. In addition to servers, operating systems, and applications, you will have compute pools, storage pools, network containers, services, and tenants (for service providers). These new constructs are tightly coupled. You cannot view their performance and capacity data in silos; they have to be managed holistically. It is important to know who your most crucial customers are — and to identify their services so you can focus on recovering them in order of priority. In addition, you may want to send out alerts to affected customers to proactively let them know there is an issue. Your operations management solution should give you a panoramic view of all these aspects and their relationships. Not only will it let you quickly isolate the problem, but it will also save you money if you know which SLAs cost more to breach and therefore should be addressed first. 

5. Enable Self-Service for Operations

To give your cloud users their desired experience while also saving on support costs, it’s important to provide constant feedback. Traditionally, performance data has not been available to the end user. In the cloud, however, there is a larger number of users or service requests with a relatively lower ratio of administrators. For that reason, it’s important to minimize the “false alarms” or manual routine requests. The best way is to let your end users see the performance and capacity data surrounding their services. You can also let your users define key performance indicators (KPIs) to monitor, the threshold levels they want to set, and some routine remediation processes they want to trigger (such as auto-scaling). The operations management solution should allow you to easily plug this data into your end-user portal. 


6. Make Cloud Services Resilient 

Resiliency is the ultimate goal of proper cloud operations management. If a solution is able to understand the behavior of cloud services and proactively pinpoint potential issues, it’s natural for that solution to automatically isolate and eliminate problems. First, the solution must have accurate behavior learning and analytics capabilities. Second, a human must create well-defined policies with an automated policy engine or a human interactive process. Lastly, the solution must plug seamlessly into other lifecycle management solutions, such as provisioning, change management, and service request management. Operations management in a silo cannot make your cloud resilient. You should plan the right architectural design as a foundation and implement a good management process that reflects the paradigm shift to ensure your success.
Thought leadership Whitepaper by Brian Singer, BMC Software