From the course: Cloud Observability and Operations: Considerations for Security, Governance, Monitoring, and Cost Control

Cloud monitoring, observability, and operations basics

- Operations, simply put, is everything that happens within a company to keep it running smoothly, and earning money. This includes providing consistent services that are able to live up to the expectations of the business. CloudOps, or Cloud Operations, is applying that concept to running cloud computing systems so that they provide consistent service to the business. CloudOps have grown rapidly in importance because cloud computing is increasingly a foundational technology infrastructure for enterprise IT. Any events such as outages and data breaches have a far-reaching negative effect on the business. Thus, there is a need to provide approaches and disciplines as to how we do Cloud Operations, including how we gain insights into how cloud systems are running. This includes obvious things such as the system not working to less obvious things such as behavior that is indicating the likelihood that an outage will occur due to some failure of a compute or storage system. It can also include taking preventative action such as automatically fixing the problem or providing alerts so that others can fix the problem. There are two levels that Cloud Operations can occur, including monitoring and observability. Monitoring is looking at the data coming in from the devices under management. This could be basic data monitoring such as a green light that indicates that something is working, to more detailed data monitoring such as the amount of storage being utilized, and the volume of bites flowing to and from some resource such as a compute or storage system on cloud providers. Monitoring could also include network utilization, and CPU saturation. All of the types of things that we considered necessary for any system's monitoring in the past and currently. Observability is related to monitoring, but it's a bit different. It's the ability to determine insights from monitoring data. For example, monitoring data could provide the status of a Bare Metal server running on a public cloud provider, including working state, data processes, CP utilizations, etcetera. All that Bare Metal means is that we're not using virtual machines or VMs that most public cloud providers offer. Instead, we're using a physical server much like we would within a traditional data center. And generally, we implement this for security and performance reasons. If we apply observability to our Bare Metal, physical service scenario, we may be able to find new insights into the state of that server. These insights could include memory performance behavior data that may lead us to understand that there is a 70% chance that the specific server will have a memory parity issue, which could result in the server resetting without warning. That reset would impact every application running at the time, and likely result in the loss of some bits of data. While we can certainly monitor memory performance data ongoing, the idea is that most humans won't or can't do that. Thus, we need automated systems using deep analytics and AI to find insights that may be missed when just looking at raw monitoring data. The ability to look at monitoring data in more productive ways is why observability is rising in popularity when it comes to Cloud Operations. The information gathered from monitoring systems is helpful, but it's the additional observability that gives us the ability to provide deeper analysis into what monitoring data actually means. It enables us to apply those insights to make better operational decisions, such as with our previous example, the ability to move application processing to another Bare Metal or virtual server automatically in order to remove the risk that the particular server will reset or shut down. If this thing's a bit complex in terms of understanding the differences, we'll drill down deeper into each topic with many examples, and it should be relatively clear to you by the end of the course. Also, keep in mind that we're learning about these topics so we can provide better Cloud Operations through better understanding of all systems in the cloud.

Contents