When an Azure Virtual Machine (VM) unexpectedly stops, conducting a root cause analysis (RCA) involves several steps to identify the underlying issue. Here’s a step-by-step guide on how to perform this analysis.
Step 1: Check the Azure Activity Log
The Azure Activity Log records all control-plane activities, including operations that start, stop, or restart VMs. This is the first place to look for clues about why the VM stopped.
Example: Accessing the Activity Log
- Go to the Azure Portal.
- Navigate to the VM in question.
- In the left-hand menu, select Activity Log.
- Filter the log by the time when the VM stopped.
Look for any operations related to Stop Virtual Machine
, Deallocate Virtual Machine
, Restart Virtual Machine
, or Delete Virtual Machine
.
Step 2: Examine the VM’s Boot Diagnostics
If the VM was stopped due to a crash or a system failure, the Boot Diagnostics logs might provide insights into what happened.
Example: Viewing Boot Diagnostics Logs
- Go to the Azure Portal.
- Navigate to the VM and select Boot diagnostics under Support + troubleshooting.
- Download the boot logs or view the screenshot to check for errors during startup or shutdown.
Step 3: Check VM Status in Azure Resource Health
Azure Resource Health provides information on the current and past health of your VM. It can indicate whether the stop was due to an Azure service issue or planned maintenance.
Example: Checking Resource Health
- Go to the Azure Portal.
- Navigate to the VM and select Resource Health under Support + troubleshooting.
- Review the health events. Look for any warnings or errors that correspond to the time when the VM stopped.
Step 4: Review the System and Application Logs
Inside the VM, review the Windows Event Logs (for Windows VMs) or syslog (for Linux VMs) to find any errors or warnings that might explain the shutdown.
Example: Accessing Logs for Windows VM
- Connect to the VM via Remote Desktop (RDP).
- Open the Event Viewer.
- Navigate to Windows Logs > System.
- Look for Critical, Error, or Warning events around the time the VM stopped.
Example: Accessing Logs for Linux VM
- Connect to the VM via SSH.
- View the syslog:
sudo cat /var/log/syslog | grep "shutdown" -A 10 -B 10
or
sudo journalctl -xe
3.Review the logs for any relevant errors.
Step 5: Analyze Azure Monitor Metrics
Azure Monitor can provide insights into the VM’s performance leading up to the stop event. Metrics like CPU, memory, and disk usage might indicate resource exhaustion or other issues.
Example: Accessing Azure Monitor Metrics
- Go to the Azure Portal.
- Navigate to Monitor > Metrics.
- Select the VM in question.
- Add metrics like
CPU Percentage
,Memory
,Disk IOPS
, etc. - Set the time range to cover the period before the VM stopped.
- Analyze the metrics for any anomalies or spikes.
Step 6: Check for Auto-Shutdown or Policy Triggers
If the VM was configured for auto-shutdown or was affected by a policy (e.g., cost-saving measures), this could be the reason for the stop.
Example: Verify Auto-Shutdown Settings
- Go to the Azure Portal.
- Navigate to the VM.
- Under Operations, select Auto-shutdown.
- Check if Auto-shutdown is enabled and if the VM stopped at the scheduled time.
Step 7: Review Azure Support and Service Health
If none of the above steps reveal the cause, it could be an Azure platform issue. Review the Azure Service Health for any incidents or outages that might have affected your VM.
Example: Check Azure Service Health
- Go to the Azure Portal.
- Navigate to Service Health.
- Review any Active Events or Past Events for your region.
Step 8: Contact Azure Support (if necessary)
If you’re unable to determine the root cause, consider opening a support ticket with Azure. Provide them with details from the steps above to help them assist you more efficiently.
Example: Open a Support Ticket
- Go to the Azure Portal.
- Select Help + support from the left-hand menu.
- Click Create a support request and provide relevant details.
Summary
- Check the Activity Log for any operations that stopped the VM.
- Review Boot Diagnostics for any errors during the VM’s startup or shutdown.
- Check Resource Health to see if Azure platform issues were involved.
- Examine System and Application Logs inside the VM.
- Analyze Azure Monitor Metrics to identify resource issues.
- Verify Auto-Shutdown or Policy Settings that could trigger a stop.
- Review Azure Service Health for incidents.
- Contact Azure Support if needed.