Setting Up Monitoring and Alerts in OCI for Your Resources

Oracle Cloud Infrastructure (OCI) provides powerful Monitoring and Alarms features to help you gain visibility into the health and performance of your cloud resources. In this blog post, we’ll walk through the step-by-step process of setting up monitoring and alerts for CPU, Memory, and other important metrics using OCI’s native tools.

✅ Whether you’re managing Compute Instances, Autonomous Databases, Load Balancers, or other services, OCI Monitoring helps you stay ahead of performance issues by automatically collecting and analyzing metrics.

🎯 What We Will Cover

Understanding OCI Monitoring and Metrics
Enabling Monitoring for Resources
Setting Up Alarms (CPU, Memory, etc.)
Viewing Metrics in the OCI Console
Notifications Using OCI Notifications Service
Bonus: Monitoring Using CLI & SDK (Optional for automation)

🧠 1. What is OCI Monitoring?

OCI Monitoring allows you to collect, view, and analyze metrics for your cloud resources in near real-time. It includes:

Metrics: Time-series data points related to a resource's performance (CPU, memory, IOPS, etc.)
Alarms: Rules that trigger when a metric crosses a threshold.
Namespaces: Logical groups of metrics. E.g., oci_computeagent, oci_autonomous_database, etc.

🛠️ 2. Enabling Monitoring for Your Resources

Most OCI services have monitoring enabled by default (e.g., Compute, Autonomous DB). For custom applications, you can publish custom metrics using the OCI SDK.

✅ Compute Instance Monitoring

To monitor compute metrics like CPU, memory, disk, etc., ensure that the OCI Compute Agent is running.

Steps:

Log in to the OCI Console.
Navigate to Compute > Instances.
Click on your instance.
In the Resources section, click Metrics.
If no metrics appear:

Make sure the instance has Monitoring agent enabled.
SSH into the instance and run:

sudo systemctl status oracle-cloud-agent

✅ Autonomous Database Monitoring

Metrics like CPU, Storage, Sessions, etc., are automatically collected.

Navigate to Autonomous Database > [Your DB] > Metrics.
Choose from metrics like CpuUtilization, StorageUtilization, etc.

🔔 3. Setting Up Alarms in OCI

Let’s walk through creating an Alarm for CPU Utilization on a compute instance.

📍 Step-by-Step: Alarm for CPU Utilization

Go to Observability & Management > Alarms.
Click Create Alarm.
Basic Info:
- Name: High-CPU-Alarm
- Compartment: Choose your compartment
- Metric namespace: oci_computeagent
Metric Details:
- Metric name: CpuUtilization
- Resource group: (Optional)
- Dimensions:
  - resourceId: Select your compute instance OCID.
- Statistic: mean
- Interval: 1 minute
Alarm Trigger Rule:
- Trigger when: CpuUtilization > 80%
- For: 3 out of 5 minutes
Notification:
- Select a Notification Topic (create one if needed).
- Example: HighCPUAlertTopic
Message Format: JSON or Raw Text
Click Create Alarm.

Repeat similar steps for:

MemoryUtilization (if agent is running)
DiskIORead, DiskIOWrite
NetworkBytesIn, NetworkBytesOut

📈 4. Viewing Metrics in the OCI Console

Navigate to Observability & Management > Metrics Explorer.
Choose the:
- Compartment
- Namespace (e.g., oci_computeagent)
- Metric name (e.g., CpuUtilization)
Set dimensions (like instance ID).
Select visualization type (line, bar, etc.)
Use filters and time range for specific insights.

📣 5. Setting Up Notifications

📍 Steps to Create a Notification Topic

Go to Observability & Management > Notifications > Topics.
Click Create Topic.
- Name: HighCPUAlertTopic
- Compartment: Choose the same as your resource
After creating the topic, click Create Subscription.
- Protocol: Email
- Endpoint: Your email address
Confirm the email subscription (check your inbox).

Your alarms will now notify you whenever thresholds are crossed!

🧪 Bonus: Monitoring with CLI (Optional for Automation)

You can also create alarms and monitor metrics using the OCI CLI for automation purposes.

Sample CLI to Get CPU Metrics:

oci monitoring metric-data summarize-metrics-data \

--namespace oci_computeagent \

--query-text "CpuUtilization[1m].mean()" \

--start-time 2025-05-25T00:00:00Z \

--end-time 2025-05-25T23:59:59Z \

--compartment-id ocid1.compartment.oc1..xxxxx \

--resource-id ocid1.instance.oc1.iad.xxxxx

🔍 Understanding Metric Namespaces and Dimensions

OCI organizes monitoring data into namespaces and dimensions:

Namespace: A logical group of metrics for a service.
- Examples:
  - oci_computeagent – Compute instance agent metrics
  - oci_autonomous_database – Autonomous DB metrics
  - oci_blockstore – Block volume metrics
Dimensions: Key-value pairs that help filter metric data.
- Example: resourceId, availabilityDomain, instanceId, etc.

💡 Tip: Use dimensions effectively to narrow down metrics for a specific resource or group.

📊 Common Metrics to Monitor by Resource Type

🖥️ Compute Instances

Metric Name	Description

CpuUtilization % CPU in use

MemoryUtilization % RAM used (requires agent)

DiskIORead/Write I/O operations per second

NetworkBytesIn/Out Incoming/Outgoing traffic

🗄️ Block Volumes

Metric	Description
`VolumeReadOps/sec`	Read ops per second
`VolumeWriteOps/sec`	Write ops per second

🧠 Autonomous Databases

Metric Name	Description
`CpuUtilization`	% CPU used
`StorageUtilization`	% storage used
`SessionCount`	Number of active DB sessions

🛡️ Best Practices for OCI Monitoring & Alerts

Use Dynamic Thresholds: Instead of static thresholds, evaluate historical trends to define meaningful alerts.
Group Alerts by Environment: Separate alerts for dev/test/prod to avoid false alarms.
Integrate with Incident Management Tools: Use OCI’s integration with PagerDuty, Slack, or Opsgenie.
Audit Your Alarms: Periodically review active alarms and remove unnecessary ones.
Tag Resources: Use tagging to easily filter and manage metrics.

🔁 Automatically Respond to Alerts

OCI Alarms can be used not only for notifications but also to trigger Functions or start automation workflows.

Example use cases:

Scale up compute instances when CPU > 80%
Restart a DB if storage utilization exceeds a threshold
Send logs to an external system (like Splunk)

⚙️ Automation with Terraform (Optional Section)

If you're managing OCI infrastructure using Terraform, you can create alarms as Infrastructure-as-Code.

Example Terraform Snippet for CPU Alarm

resource "oci_monitoring_alarm" "cpu_alarm" {

compartment_id = var.compartment_ocid

display_name = "High CPU Alarm"

metric_query = "CpuUtilization[1m].mean() > 80"

severity = "CRITICAL"

body = "CPU usage is above threshold"

is_enabled = true

message_format = "TEXT"

repeat_notification_duration = "PT10M"

destinations = [oci_ons_notification_topic.cpu_topic.id]

query = <<EOT

CpuUtilization[1m].mean() > 80

EOT

metric_compartment_id = var.compartment_ocid

namespace = "oci_computeagent"

resource_group = ""

}

🔁 Integration with Logging and Events

Combine Monitoring, Logging, and Events for complete observability.

Use Logging Search to investigate issues flagged by Alarms.
Use Event Rules to trigger remediation or audits.

🧵 Wrapping Up

With the power of OCI Monitoring and Alarms, you can:

✅ Stay ahead of system issues
✅ Get real-time notifications
✅ Automate incident response
✅ Maintain high availability and performance

By leveraging the steps above, you’ll have a robust observability setup for your OCI workloads. Don't forget to regularly audit your alarms and notification settings to align with evolving performance requirements.

Search This Blog

Vidhyadharan