Skip to main content
Audience: Platform Teams, Data Platform, SRE Data incidents need the same rigor as application incidents. This guide helps you set up 24/7 monitoring with proper escalation, on-call routing, and incident response.

The Goal

On-call data alerting flow from detection to resolution

Architecture Overview

Alerting architecture showing event routing to different destinations

Setting Up PagerDuty Integration

Step 1: Create PagerDuty Service

In PagerDuty:
  1. Go to Services β†’ New Service
  2. Name: Data Observability - AnomalyArmor
  3. Integration: Select Events API V2
  4. Copy the Integration Key

Step 2: Add PagerDuty Destination in AnomalyArmor

  1. Go to Alerts β†’ Destinations
  2. Click Add Destination
  3. Select PagerDuty
  4. Enter the Integration Key
  5. Name: PagerDuty - Data On-Call
  6. Test and Save

Step 3: Configure Escalation Policy

In PagerDuty, set up escalation: Escalation policy levels

Alert Urgency Framework

Define how urgently different data incidents need response:

Critical (Page Immediately)

Criteria:
  • Production data pipeline completely down
  • Core revenue tables missing or stale >4 hours
  • Discovery failures for >24 hours
Examples:
  • Column removed from orders table
  • payments table data >4 hours stale
  • Can’t connect to production database
Destination: PagerDuty β†’ On-Call

High (Respond Within 4 Hours)

Criteria:
  • Important tables stale (1-4 hours)
  • Schema changes in production
  • Non-critical discovery failures
Examples:
  • Column type changed in production
  • Analytics tables 2 hours stale
  • Staging discovery failed
Destination: Slack #data-incidents

Medium (Respond Within 24 Hours)

Criteria:
  • Non-production schema changes
  • Warning thresholds reached
  • New assets discovered
Examples:
  • Staging schema changed
  • Freshness approaching SLA (warning)
  • New table discovered in production
Destination: Slack #data-alerts

Low (Informational)

Criteria:
  • Development changes
  • Expected changes
  • Routine discoveries
Destination: Email digest (daily)

Alert Rule Configuration

Rule 1: Critical - Production Breaking Changes

FieldValue
NameCRITICAL - Production Breaking Changes
EventSchema Change Detected
Data sourceproduction-*
Schemapublic, analytics
Change typeColumn Removed, Table Removed
DestinationsPagerDuty (Data On-Call), Slack #data-incidents

Rule 2: Critical - Revenue Table Freshness

FieldValue
NameCRITICAL - Revenue Data Stale
EventFreshness Violation
Assetsorders, payments, revenue_*
SLA exceeded by>4 hours
DestinationsPagerDuty (Data On-Call), Slack #data-incidents

Rule 3: High - Production Schema Changes

FieldValue
NameProduction Schema Changes
EventSchema Change Detected
Data sourceproduction-*
Change typeAll
DestinationsSlack #data-incidents

Rule 4: High - Data Freshness Violations

FieldValue
NameHIGH - Data Freshness Violations
EventFreshness Violation
Data sourceproduction-*
ConditionSLA exceeded
DestinationsSlack #data-incidents

Rule 5: High - Discovery Failures

FieldValue
NameHIGH - Discovery Failures
EventDiscovery Failed
Data sourceproduction-*
DestinationsSlack #data-incidents, Email data-platform@company.com

On-Call Runbook

When Paged for Schema Change

On-call runbook for schema changes

When Paged for Freshness Violation

  1. ACKNOWLEDGE the alert
  2. CHECK ETL STATUS
    • Is the ETL job running? Failed? Stuck?
    • Check Airflow/Dagster/orchestrator
  3. CHECK SOURCE SYSTEM
    • Is the source database accessible?
    • Is source data actually updating?
  4. IDENTIFY ROOT CAUSE
    • ETL failure β†’ Fix and restart
    • Source delay β†’ Communicate delay
    • Connection issue β†’ Troubleshoot connection
  5. MITIGATE
    • Restart failed jobs
    • Notify stakeholders of delay
  6. RESOLVE and document

Slack Integration Best Practices

Channel Setup

Slack Channels:
  • #data-incidents - Breaking changes (notifications on)
  • #data-alerts - All schema changes (lower priority)
  • #data-digest - Daily/weekly summaries

Alert Message Format

AnomalyArmor alerts include:
πŸ”΄ CRITICAL: Schema Change Detected

Asset: production.public.orders
Change: Column removed - shipping_status (varchar)

Detected: Today at 3:15 PM UTC
Discovery Run: #12345

Impact: High - This table is used by 5 downstream models

Actions:
β€’ [View in AnomalyArmor]
β€’ [View Asset Details]
β€’ [View Downstream Dependencies]

On-Call: @data-oncall

Maintenance Windows

Scheduled Maintenance

Before planned changes:
  1. Go to Alerts β†’ Rules
  2. Toggle OFF relevant rules
  3. Set a reminder to re-enable (e.g., calendar event)
  4. Proceed with maintenance
  5. Verify changes detected correctly
  6. Toggle rules back ON

Quick Disable

For unexpected but known issues, quickly disable a rule:
  1. Go to Alerts β†’ Rules
  2. Find the rule
  3. Toggle it OFF
  4. Remember to re-enable when the issue is resolved

Metrics to Track

MetricTargetHow to Measure
MTTD (Time to Detect)< 1 hourDiscovery frequency
MTTN (Time to Notify)< 5 minAlert β†’ PagerDuty time
MTTR (Time to Resolve)< 4 hoursAlert β†’ Resolution time
False Positive Rate< 20%Alerts ignored / Total alerts
Pager Load< 5/weekCritical alerts per week
Review these weekly in your on-call handoff.

Checklist

Before going live with on-call alerting:
  • PagerDuty integration configured
  • Escalation policy set up
  • Critical/High/Medium/Low rules defined
  • Slack channels created and configured
  • On-call runbook documented
  • Team trained on response procedures
  • Test alert sent and verified

PagerDuty Setup

Detailed PagerDuty integration guide

Alert Best Practices

Reduce alert fatigue