Imagine you wake up to a message:
“Why is prod down… and why did our cloud bill spike overnight?”
You open dashboards. CPU looks normal now. No obvious deploy. No incidents logged.
Then you notice something: a new admin user was created at 2:13 AM, in a region your company doesn’t even use. A firewall rule was opened. A storage bucket’s permissions changed. Logs were disabled… and re-enabled.
How do you prove what happened—quickly, confidently, and in a way that stands up in a security review?
That’s what cloud audit logging is for.
This guide is the “do it right once” blueprint:
- What to log (so you don’t miss the critical events)
- How long to keep logs (without drowning in cost)
- What alerts actually work (high-signal, low-noise)
No fluff. Just a practical system you can implement.

1) What “audit logs” really are (and what they are not)
Audit logs answer: “Who did what, where, and when?”
They capture control plane actions like:
- Creating/deleting resources
- Changing IAM permissions
- Modifying firewall rules
- Disabling encryption
- Turning off logging (yes, this happens)
Audit logs are not:
- Application logs (your app’s errors, traces)
- Network flow logs (who talked to whom over the network)
- OS logs (syslog, auth.log)
- Data content logs (actual database rows or file contents)
A good cloud logging strategy uses all of them, but audit logs are the non-negotiable foundation—because they explain changes to the environment itself.
2) The 3 outcomes your audit logging must deliver
If your audit logging does not achieve these, it’s just storage.
Outcome A — Fast incident investigation
When something goes wrong, you can answer within minutes:
- What changed?
- Who changed it?
- Was it accidental, automated, or malicious?
Outcome B — Continuous detection (not monthly surprises)
You don’t wait for a breach report. You get alerted when risky actions occur.
Outcome C — Compliance and forensics-ready evidence
If you ever need to prove “this is what happened,” your logs are:
- Complete enough
- Protected against tampering
- Retained long enough
3) Step-by-step setup: a “minimum viable audit logging” blueprint
Step 1 — Define your “crown jewels”
Write down what you must protect first:
- Production accounts/subscriptions/projects
- Identity system (IAM, SSO, roles)
- Secrets (KMS/HSM, secret manager)
- Databases and storage
- Network perimeter (firewalls, gateways)
- CI/CD pipelines
You’ll use this list to decide which logs get extra retention and extra alerts.
Step 2 — Inventory your audit log sources
Most teams enable “some logging,” but miss important log sources.
At minimum, you want audit logs from:
- Identity & access management (users, roles, policies, keys, MFA)
- Resource management (create/update/delete of compute, DB, storage)
- Network & perimeter (firewall rules, load balancers, routes, VPN, peering)
- Key management & secrets (KMS keys, key policies, secret access changes)
- Logging and security tools themselves (SIEM, log sinks, detectors)
Then decide if you also need data access audit logs for sensitive services:
- Object storage reads (who accessed what)
- Database audit logs (connections, schema changes)
- Secrets access events
Step 3 — Centralize logs (the “log archive” pattern)
This is where most setups fail.
If logs stay inside the same account where workloads run, an attacker who gains admin can try to:
- Disable logging
- Delete logs
- Change retention
- Alter sinks/exporters
Best practice: ship audit logs to a separate log archive environment with tighter controls.
Minimum safeguards:
- Workload accounts can write to archive, but cannot delete.
- Only a small security/admin group can manage retention and deletion.
- Logs are encrypted, and access is heavily monitored.
Think of it as a bank vault. You don’t store security camera footage in the cashier’s drawer.
Step 4 — Protect log integrity (tamper resistance)
Your audit logs should be:
- Write-once-ish (or at least delete-protected)
- Encrypted at rest
- Access-controlled
- Monitored for changes
Add alerts for:
- Retention reduced
- Log export disabled
- Log storage deleted
- Permissions changed on log archive
Because the first thing a skilled attacker tries is: erase evidence.
Step 5 — Normalize the key fields (so you can actually alert)
Different clouds name fields differently, but your detection logic needs consistent concepts.
Normalize into these fields in your SIEM/log platform:
timestampactor(who did it)actor_type(user, role, service account, workload identity)action(what they did)target(resource affected)result(success/fail)source_ipuser_agent/callerregion/locationauth_method(MFA? key? federated?)request_id(for correlation)
Once you have these, you can build alert rules that survive provider differences.
4) What to log: the practical “Tier 0 / Tier 1 / Tier 2” model
You don’t want to log “everything” blindly. You want everything that matters, and you want to treat critical events differently.
Tier 0 (must log, must alert, must retain longer)
These events are high impact and often high risk:
Identity & privilege
- New user / service account created
- Role assigned / elevated permissions
- Policy changes (especially admin permissions)
- Access keys created / rotated / disabled
- MFA disabled or bypassed
- SSO / federation settings changed
- “Break-glass” account used
Security posture
- Logging disabled, modified, or retention reduced
- Encryption disabled, key policy changed
- Public access enabled on storage
- Firewall/security rules opened broadly (0.0.0.0/0 or equivalent)
- Threat detection tool disabled
Network perimeter
- New ingress rules, VPN/peering changes
- Route table changes
- Load balancer listener changes (new ports exposed)
Data access (for crown jewels)
- Reads/downloads from sensitive storage buckets
- Database audit events (schema changes, new admin users)
- Secrets access policy changes
Tier 1 (log + alert in production; retain medium)
These drive most “how did this happen?” investigations:
- Compute instance/container created, deleted, resized
- Autoscaling configuration changed
- Database instance class/storage modified
- Backup policies changed
- Infrastructure-as-code pipeline changes (who approved/applied)
- Container registry settings changed
- Key service configuration changes (queues, topics, gateways)
Tier 2 (log for investigation; alert only if correlated)
These create noise if alerted on alone, but help in correlation:
- Read-only inventory events
- Frequent benign API calls from automation
- Routine deployment activities (if already covered by CI/CD logs)
5) Retention: how long should you keep audit logs?
Here’s the honest truth:
Retention isn’t a number. It’s a strategy.
You keep different levels of detail for different time windows.
The 3-tier retention strategy (simple and effective)
Hot (fast search, high cost): 30–90 days
- Full fidelity audit logs
- Indexed for quick investigation
- Used for day-to-day detection and incident response
Warm (searchable but cheaper): 3–12 months
- Still searchable, maybe slower
- Used for trend analysis, investigations that start late, internal audits
Cold (cheap archive): 1–7 years (or per compliance)
- Stored for compliance, legal holds, long investigations
- Not always instantly searchable, but retrievable
How to decide your numbers (beginner-friendly rule)
Ask these questions:
- How long do incidents typically go unnoticed?
If the answer is “weeks,” your hot retention must cover it. - What are your compliance requirements?
Some industries require longer retention. - How often do you need to investigate old events?
If you routinely do postmortems or customer audits, keep warm longer.
A practical default (works for many teams)
- Hot: 60–90 days
- Warm: 12 months
- Cold: 3–7 years for Tier 0 events (or required systems)
And remember: you can keep Tier 0 longer than everything else.
6) Alerting use cases: high-signal alerts engineers won’t ignore
Most teams fail at alerting because:
- They alert on noisy events (“someone listed buckets”)
- They don’t include context (who, where, what changed)
- They don’t route alerts to owners with clear actions
A good alert rule is like a good bug report:
- What happened
- Why it matters
- Who did it
- What changed
- What to do next
Below is an “alert pack” you can implement.
Alert Pack A — Identity and privilege changes (Tier 0)
1) New admin granted
Trigger when:
- Any principal gets a high-privilege role or policy
Include in alert: - Actor, target identity, new permission set, source IP, region
2) Access key created
Trigger when:
- New access key or credential generated
Extra rule: - Higher severity if created outside business hours or from unusual IP
3) MFA disabled
Trigger when:
- MFA removed, disabled, or device changed
Extra: - If it happens on a privileged account → critical
4) SSO / federation configuration changed
Trigger when:
- Identity provider settings changed
Why it matters: - Attackers love rerouting auth.
Alert Pack B — “Covering tracks” events (Tier 0)
5) Audit logging disabled or modified
Trigger when:
- Logging stopped
- Log export/sink removed
- Retention reduced
- Log bucket permissions changed
This is usually critical. If someone can do this, they can hide everything else.
6) Log archive deletion attempts
Trigger when:
- Delete operations attempted on log storage, even if denied
Alert Pack C — Network and perimeter exposures (Tier 0)
7) Firewall opened to the world
Trigger when:
- Inbound rule allows wide-open ranges on sensitive ports
Severity guide: - Critical for SSH/RDP/admin ports
- High for databases
- Medium for app ports depending on architecture
8) New public load balancer/listener created
Trigger when:
- Public-facing endpoints added
Include: - Port, protocol, target resource, change actor
9) Route table or gateway changes
Trigger when:
- Routes modified in a way that changes egress path (exfiltration risk)
Alert Pack D — Data protection and encryption (Tier 0)
10) Encryption disabled / key policy changed
Trigger when:
- Encryption settings reduced
- Key access expanded
- Key rotation disabled
Why it matters: - This impacts blast radius and compliance.
11) Storage bucket made public
Trigger when:
- Public ACL/policy allowed
Include: - Bucket name, policy diff summary, actor
Alert Pack E — Suspicious behavior patterns (high value)
These require baselines, but pay off big.
12) Unusual region usage
Trigger when:
- Sensitive actions occur in regions you don’t operate in
13) Spike in failed auth
Trigger when:
- Many failures from one IP or many IPs for same identity
14) Privileged action from new IP / new user agent
Trigger when:
- Admin role used from never-seen network or device signature
15) Mass deletion or destructive burst
Trigger when:
- Delete operations exceed a threshold in a short time window
Examples: - Many storage objects deleted
- Many instances terminated
- Many IAM policies removed
7) Make alerts actionable: add context + a runbook snippet
Every alert should include a tiny “what to do now” section.
Example for “New admin granted”:
Immediate actions
- Verify change ticket / deployment record
- Confirm actor identity and auth method
- If unexpected: disable the credential/session and revoke the role
- Search for follow-on actions by the same actor in the next 15 minutes
- Open incident if any Tier 0 follow-up occurred (logging change, firewall open, key creation)
This one addition turns alerts from noise into response.
8) Real examples (what audit logs catch in the real world)
Example 1: “We got billed $8k overnight”
Audit logs reveal:
- A new compute resource created at 2 AM
- Region differs from normal
- Actor is an access key that hasn’t been used in months
Response:
- Disable the key
- Terminate the resources
- Backtrack all actions from that actor
Without audit logs, you’d be guessing.
Example 2: “Database became publicly accessible”
Audit logs show:
- Security rule changed to allow inbound from anywhere
- Change was made by a CI role used outside normal pipeline hours
Response:
- Revert rule
- Investigate CI credential misuse
- Add alert + policy guardrail to block future changes
Example 3: “We can’t find logs for the incident window”
Audit logs show:
- Logging was disabled for 12 minutes
- Log sink permissions were modified
Response:
- Treat as high-severity security incident
- Lock down log archive permissions
- Add “logging disabled” as critical alert
9) The most common mistakes (and how to avoid them)
Mistake 1: Only logging in one account/project
Fix: centralize across all environments, especially prod and identity.
Mistake 2: Keeping logs but not searching them
Fix: hot retention must be indexed and queryable.
Mistake 3: No alerts for disabling logging
Fix: alert on changes to logging configuration and retention.
Mistake 4: “Everything alerts all the time”
Fix: Tier your alerts and focus on Tier 0 first.
Mistake 5: No ownership
Fix: route alerts to teams who can act, not a generic inbox.
10) A simple “starter implementation” you can follow this week
Day 1: Enable + centralize
- Turn on cloud audit logs across prod accounts/projects
- Export/stream them to a log archive environment
Day 2: Lock down integrity
- Restrict deletion and retention changes
- Alert on log configuration changes
Day 3: Implement the Tier 0 alert pack
Start with:
- New admin granted
- Key created
- MFA disabled
- Logging disabled/modified
- Firewall opened broadly
- Storage made public
- Encryption disabled/key policy changed
Day 4: Test your alerts (yes, actually test)
Perform safe test actions in a sandbox:
- Create a test user
- Assign a test role
- Modify a test firewall rule
Confirm alerts fire with the right context.
Day 5: Add a weekly review
A 30-minute weekly routine:
- Top 10 risky actions
- Any anomalies
- Any unowned resources/identities
That’s how audit logging becomes a living system.
11) Quick checklist: “Audit logging done right”
Coverage
- Identity/IAM events
- Resource lifecycle events
- Network/perimeter events
- KMS/secrets events
- Logging configuration events
- Data access logs for crown jewels
Security
- Central log archive
- Least privilege access to logs
- Delete protection / tamper resistance
- Alerts on logging changes
Retention
- Hot searchable window (30–90 days)
- Warm medium-term (up to 12 months)
- Cold archive (years as needed)
- Longer retention for Tier 0
Alerting
- Tier 0 high-signal alerts implemented
- Alerts include actor + target + change + next steps
- Ownership and routing defined
- Alerts tested at least once
Final thought (the part most teams miss)
Audit logging isn’t about collecting evidence for “one day.”
It’s about creating a reality where:
- risky actions are visible,
- suspicious actions are loud,
- and mistakes are reversible before they become disasters.
Below is a cloud-specific, engineer-friendly “Tier 0” audit logging + alert pack for AWS and Azure, plus a dashboard layout and a retention plan you can implement with confidence. I’ll keep it practical, step-by-step, and packed with examples.
AWS + Azure Cloud Audit Logging (Tier 0): What to log, retention, and alert pack
The goal (what “good” looks like)
Within minutes of any risky change, you should be able to answer:
- Who did it (identity, role, service principal)
- What changed (policy, firewall, key, logging)
- Where it happened (account/subscription, region)
- How it happened (MFA? access key? workload identity? CI/CD?)
- What else they did next (follow-on actions)
If you can do that reliably, your audit logging is doing real work.
Part 1 — What to log (AWS vs Azure)
AWS: required audit log sources
A) CloudTrail management events (Tier 0 foundation)
These are control plane changes: IAM, EC2, VPC, S3 policy changes, etc.
Log these always:
- CloudTrail Management Events (Read can be optional; Write is mandatory)
- CloudTrail for all regions
- Prefer organization-level trails (all accounts)
B) CloudTrail data events (enable selectively for crown jewels)
Data events can get noisy/costly, so enable them for sensitive resources:
- S3 object-level access (GetObject/PutObject/DeleteObject)
- Lambda Invoke (if needed)
- DynamoDB data plane (as needed)
C) AWS Config (configuration history + drift)
Config gives you “state change over time.” CloudTrail gives “who did it.”
You want both.
D) VPC Flow Logs (not audit, but critical companion)
Use it when you investigate exfiltration or weird connections.
E) EKS audit logs (if Kubernetes runs your workloads)
Kubernetes API actions (RBAC changes, secret reads, etc.) can be its own security story.
Azure: required audit log sources
A) Azure Activity Log (Tier 0 foundation)
This is Azure’s control plane log: resource changes and administrative actions.
Log these always:
- Activity Log for all subscriptions
- Export/stream to centralized analytics + archive
B) Microsoft Entra ID (Azure AD) logs (identity is Tier 0)
You want:
- Sign-in logs
- Audit logs (directory changes, app/service principal changes)
Identity is the most common starting point for incidents.
C) Azure Resource Graph / change history (optional but helpful)
For “what changed where” queries.
D) Azure Policy events (policy changes + compliance drift)
Because policy changes can create “silent permission expansion.”
E) NSG flow logs (companion to investigate network anomalies)
Good for confirming exposure or exfil patterns.
Part 2 — Centralize logs (the “log archive account” pattern)
AWS recommended pattern
- Create a dedicated Log Archive account (separate from workloads)
- Send CloudTrail logs to an S3 bucket in Log Archive
- Lock it down so workload accounts can’t delete or reduce retention
- Send to your SIEM/search layer for hot queries
Azure recommended pattern
- Use a centralized Log Analytics workspace (security workspace)
- Export Activity Logs + Entra logs into it
- Configure long-term archive to storage (immutable policies if possible)
- Lock down who can change diagnostics/export settings
Why this matters: an attacker who compromises prod admin will try to delete evidence. Your design must assume that.
Part 3 — Tier 0 “What to log” list (AWS + Azure mapping)
Below are the highest-impact audit events you should log and alert on immediately.
1) Identity & privilege escalation (critical)
AWS events to alert on
- New IAM user created
- New access keys created
- Policy attached/detached to a user/role
- Role trust policy changed (who can assume role)
- MFA disabled or removed
- Root account activity (especially access key creation, login)
Example scenario
A developer role suddenly gains AdministratorAccess at 1:48 AM.
That’s not “ops.” That’s an incident until proven otherwise.
Azure events to alert on
- Role assignments created/updated (RBAC)
- Privileged role assignments (Owner, Contributor, User Access Administrator)
- Service principal credentials added (client secret/cert)
- New enterprise app or app consent granted with broad permissions
- Conditional Access policies modified
- MFA methods removed / security info changed (if tracked)
Example scenario
A service principal gets Owner role on a subscription.
Even if intended, this should alert and create an approval trail.
2) “Covering tracks” changes (critical)
These are the highest-signal alerts in any cloud.
AWS
- CloudTrail stopped, deleted, updated
- S3 bucket policy changed for the CloudTrail log bucket
- CloudTrail log file validation disabled
- KMS key policy changes protecting log bucket
- Config recorder disabled
Azure
- Diagnostic settings removed/changed (Activity Log export disabled)
- Log Analytics workspace retention reduced
- Storage account used for archive deleted or access reduced
- Sentinel/SIEM connectors disabled (if using)
Example scenario
Logging disabled for 6 minutes during a suspicious admin session.
Treat that as “active intrusion” until you know otherwise.
3) Network exposure (critical)
AWS
- Security group inbound opened broadly (0.0.0.0/0 or ::/0)
- NACL rules loosened
- Internet Gateway attached
- Route table changes that alter egress path
- New public load balancer listener
- VPC peering changes
Azure
- NSG inbound rule opened broadly
- Public IP created/attached to sensitive resources
- Firewall rules changed (Azure Firewall, app gateways)
- Route table changes, peering changes
- Key management ports exposed
Example scenario
Port 22 or 3389 opened to the world on any resource in prod.
Immediate action required.
4) Data exposure (critical for crown jewels)
AWS
- S3 bucket policy changed to public access
- Public access block disabled
- KMS encryption removed from bucket
- RDS public accessibility enabled
- Snapshot shared publicly or with unexpected account
Azure
- Storage container made public
- SAS tokens created with broad permissions (if logged/trackable)
- Key Vault access policies expanded
- SQL firewall rules opened broadly / public endpoint enabled
Example scenario
A storage bucket/container becomes public “by mistake.”
You want an alert within seconds—not after someone finds it on the internet.
Part 4 — Tier 0 Alert Pack (ready to implement)
I’ll give you a minimum set of alerts that provides maximum security coverage with low noise.
AWS Tier 0 Alerts (15 high-signal rules)
- Root account sign-in (any root login)
- Root access key created
- CloudTrail stop/delete/update
- Config disabled (recorder/aggregator stopped)
- IAM user created (outside automation allowlist)
- Access keys created (especially for privileged users)
- AdministratorAccess attached (or equivalent policy)
- Role trust policy changed (AssumeRole policy updated)
- MFA device removed/deactivated
- S3 bucket made public or PublicAccessBlock disabled
- KMS key policy changed (especially logs + secrets keys)
- Security group opened to world on sensitive ports
- RDS made publicly accessible
- Unusual region activity (write actions in a region you don’t use)
- Mass deletion burst (lots of deletes in short window)
Practical filter tip: Maintain an allowlist of known automation roles (Terraform/CI) and still alert if they do Tier 0 actions outside change windows.
Azure Tier 0 Alerts (15 high-signal rules)
- Owner role assignment created at subscription/resource group scope
- User Access Administrator role assigned (RBAC management power)
- New service principal created or app registration created
- Credentials added to service principal (secret/cert)
- Conditional Access policy changed
- Entra ID admin role assigned (privileged directory roles)
- Activity Log export/diagnostic settings disabled
- Log Analytics retention reduced / workspace settings changed
- Storage container set to public / access level changed
- Key Vault access policy expanded / RBAC permission expanded
- NSG rule opened to world on sensitive ports
- Public IP attached to sensitive workloads
- SQL firewall opened broadly / public endpoint enabled
- Unusual location sign-in for privileged identities
- Sign-in risk / multiple failed logins burst (identity attack pattern)
Part 5 — Make alerts actionable (the template that stops alert fatigue)
Every alert should include:
A) What changed
Example: “Security group inbound rule updated: added 0.0.0.0/0 on port 22”
B) Who did it
Actor identity + role + auth type (MFA? key? service principal?)
C) Where
Account/subscription + region + resource name
D) Why this matters (one line)
“Exposes remote administration to the internet.”
E) What to do now (3 steps)
- Revert change (or isolate resource)
- Validate actor is legitimate (ticket, pipeline, approval)
- Search for follow-on actions by same actor for next 15 minutes
This prevents the classic “alert fired… nobody knew what to do… so it got ignored.”
Part 6 — Retention (AWS + Azure) you can use without guessing
A practical retention plan (works for most orgs)
Hot (fast searchable)
- 90 days for all Tier 0 + Tier 1 audit logs
Warm (searchable but cheaper)
- 12 months for management/control plane logs (CloudTrail management / Azure Activity)
Cold (archive)
- 3–7 years for Tier 0 events and identity logs (Entra audit/sign-in, root activity, RBAC changes)
- Keep longer if your industry requires it
How to control cost without losing security
- Keep Tier 0 fully searchable for 90 days
- Move everything else older than 90 days to cheaper storage/index tiers
- Enable data access logs only for crown jewels (S3/Storage/Key Vault/etc.)
- Reduce noise by filtering high-frequency read events unless required
Part 7 — Dashboard layout (what security + engineering will actually use)
Create one dashboard per cloud, but keep the same sections so teams learn it once.
Section 1: “Today’s Risk”
- Count of Tier 0 events today
- Highest severity events (top 10)
- Logging tamper events count
Section 2: Identity
- New privileged assignments (last 24h)
- New keys/secrets created (last 24h)
- Unusual sign-ins (geo, device, IP)
Section 3: Perimeter changes
- New public endpoints
- Firewall/SG/NSG rules opened broadly
- Route/gateway changes
Section 4: Data exposure
- Public storage changes
- Encryption/key policy changes
- DB public endpoint changes
Section 5: “Top Actors”
- Top identities performing changes (last 24h)
- New actors never seen before (last 7 days)
Section 6: “Unallocated/Unknown”
- Events from identities with no clear owner tag/label
- Resources modified without ownership metadata
This dashboard makes investigations fast and keeps your team curious because it constantly answers: “what changed, and should I care?”
Part 8 — The “Starter Plan” (5 days to a working system)
Day 1: Enable Tier 0 sources
- AWS: org-level CloudTrail management events (all regions)
- Azure: Activity Logs + Entra sign-in & audit logs
Day 2: Centralize + lock down
- AWS: Log Archive account + protected S3 bucket + restricted delete
- Azure: Central Log Analytics + archive to storage + restricted settings change
Day 3: Implement Tier 0 alerts (start with 8)
AWS first 8:
- root login, key created, CloudTrail changed, admin policy attached, key created, SG open, S3 public, unusual region
Azure first 8:
- Owner role assigned, SP credential added, diagnostics disabled, NSG open, storage public, Key Vault policy expanded, unusual sign-in, SQL firewall broad
Day 4: Test alerts
Do safe test actions in a sandbox subscription/account and confirm:
- alert fires
- contains actor + target + change + next steps
Day 5: Add weekly review routine
30 minutes:
- review Tier 0 list
- close false positives by allowlisting known automation identities
- create 1–3 backlog items to reduce recurring risk