On-Call Data Alerting - AnomalyArmor

Audience: Platform Teams, Data Platform, SRE Data incidents need the same rigor as application incidents. This guide helps you set up 24/7 monitoring with proper escalation, on-call routing, and incident response.

The Goal

On-call data alerting flow from detection to resolution

Architecture Overview

Alerting architecture showing event routing to different destinations

Setting Up PagerDuty Integration

Step 1: Create PagerDuty Service

In PagerDuty:

Go to Services → New Service
Name: Data Observability - AnomalyArmor
Integration: Select Events API V2
Copy the Integration Key

Step 2: Add PagerDuty Destination in AnomalyArmor

Go to Alerts → Destinations
Click Add Destination
Select PagerDuty
Enter the Integration Key
Name: PagerDuty - Data On-Call
Test and Save

Step 3: Configure Escalation Policy

In PagerDuty, set up escalation:

Alert Urgency Framework

Define how urgently different data incidents need response:

Critical (Page Immediately)

Criteria:

Production data pipeline completely down
Core revenue tables missing or stale >4 hours
Discovery failures for >24 hours

Examples:

Column removed from orders table
payments table data >4 hours stale
Can’t connect to production database

Destination: PagerDuty → On-Call

High (Respond Within 4 Hours)

Criteria:

Important tables stale (1-4 hours)
Schema changes in production
Non-critical discovery failures

Examples:

Column type changed in production
Analytics tables 2 hours stale
Staging discovery failed

Destination: Slack #data-incidents

Medium (Respond Within 24 Hours)

Criteria:

Non-production schema changes
Warning thresholds reached
New assets discovered

Examples:

Staging schema changed
Freshness approaching SLA (warning)
New table discovered in production

Destination: Slack #data-alerts

Low (Informational)

Criteria:

Development changes
Expected changes
Routine discoveries

Destination: Email digest (daily)

Alert Rule Configuration

Rule 1: Critical - Production Breaking Changes

Field	Value
Name	CRITICAL - Production Breaking Changes
Event	Schema Change Detected
Data source	`production-*`
Schema	`public`, `analytics`
Change type	Column Removed, Table Removed
Destinations	PagerDuty (Data On-Call), Slack `#data-incidents`

Rule 2: Critical - Revenue Table Freshness

Field	Value
Name	CRITICAL - Revenue Data Stale
Event	Freshness Violation
Assets	`orders`, `payments`, `revenue_*`
SLA exceeded by	>4 hours
Destinations	PagerDuty (Data On-Call), Slack `#data-incidents`

Rule 3: High - Production Schema Changes

Field	Value
Name	Production Schema Changes
Event	Schema Change Detected
Data source	`production-*`
Change type	All
Destinations	Slack `#data-incidents`

Rule 4: High - Data Freshness Violations

Field	Value
Name	HIGH - Data Freshness Violations
Event	Freshness Violation
Data source	`production-*`
Condition	SLA exceeded
Destinations	Slack `#data-incidents`

Rule 5: High - Discovery Failures

Field	Value
Name	HIGH - Discovery Failures
Event	Discovery Failed
Data source	`production-*`
Destinations	Slack `#data-incidents`, Email `data-platform@company.com`

On-Call Runbook

When Paged for Schema Change

When Paged for Freshness Violation

ACKNOWLEDGE the alert
CHECK ETL STATUS
- Is the ETL job running? Failed? Stuck?
- Check Airflow/Dagster/orchestrator
CHECK SOURCE SYSTEM
- Is the source database accessible?
- Is source data actually updating?
IDENTIFY ROOT CAUSE
- ETL failure → Fix and restart
- Source delay → Communicate delay
- Connection issue → Troubleshoot connection
MITIGATE
- Restart failed jobs
- Notify stakeholders of delay
RESOLVE and document

Slack Integration Best Practices

Channel Setup

Slack Channels:

#data-incidents - Breaking changes (notifications on)
#data-alerts - All schema changes (lower priority)
#data-digest - Daily/weekly summaries

Alert Message Format

AnomalyArmor alerts include:

🔴 CRITICAL: Schema Change Detected

Asset: production.public.orders
Change: Column removed - shipping_status (varchar)

Detected: Today at 3:15 PM UTC
Discovery Run: #12345

Impact: High - This table is used by 5 downstream models

Actions:
• [View in AnomalyArmor]
• [View Asset Details]
• [View Downstream Dependencies]

On-Call: @data-oncall

Maintenance Windows

Scheduled Maintenance

Before planned changes:

Go to Alerts → Rules
Toggle OFF relevant rules
Set a reminder to re-enable (e.g., calendar event)
Proceed with maintenance
Verify changes detected correctly
Toggle rules back ON

Quick Disable

For unexpected but known issues, quickly disable a rule:

Go to Alerts → Rules
Find the rule
Toggle it OFF
Remember to re-enable when the issue is resolved

Metrics to Track

Metric	Target	How to Measure
MTTD (Time to Detect)	< 1 hour	Discovery frequency
MTTN (Time to Notify)	< 5 min	Alert → PagerDuty time
MTTR (Time to Resolve)	< 4 hours	Alert → Resolution time
False Positive Rate	< 20%	Alerts ignored / Total alerts
Pager Load	< 5/week	Critical alerts per week

Review these weekly in your on-call handoff.

Checklist

Before going live with on-call alerting:

PagerDuty integration configured
Escalation policy set up
Critical/High/Medium/Low rules defined
Slack channels created and configured
On-call runbook documented
Team trained on response procedures
Test alert sent and verified

PagerDuty Setup

Detailed PagerDuty integration guide

Alert Best Practices

Reduce alert fatigue

Getting Started

Core Concepts

Data Sources

Detect Schema Changes

Monitor Data Health

Coverage Tiers

Get Notified

Understand Your Data

Organize & Tag

Guides

Account & Settings

Security

Help

Downloads

​The Goal

​Architecture Overview

​Setting Up PagerDuty Integration

​Step 1: Create PagerDuty Service

​Step 2: Add PagerDuty Destination in AnomalyArmor

​Step 3: Configure Escalation Policy

​Alert Urgency Framework

​Critical (Page Immediately)

​High (Respond Within 4 Hours)

​Medium (Respond Within 24 Hours)

​Low (Informational)

​Alert Rule Configuration

​Rule 1: Critical - Production Breaking Changes

​Rule 2: Critical - Revenue Table Freshness

​Rule 3: High - Production Schema Changes

​Rule 4: High - Data Freshness Violations

​Rule 5: High - Discovery Failures

​On-Call Runbook

​When Paged for Schema Change

​When Paged for Freshness Violation

​Slack Integration Best Practices

​Channel Setup

​Alert Message Format

​Maintenance Windows

​Scheduled Maintenance

​Quick Disable

​Metrics to Track

​Checklist

​Related Resources

PagerDuty Setup

Alert Best Practices

The Goal

Architecture Overview

Setting Up PagerDuty Integration

Step 1: Create PagerDuty Service

Step 2: Add PagerDuty Destination in AnomalyArmor

Step 3: Configure Escalation Policy

Alert Urgency Framework

Critical (Page Immediately)

High (Respond Within 4 Hours)

Medium (Respond Within 24 Hours)

Low (Informational)

Alert Rule Configuration

Rule 1: Critical - Production Breaking Changes

Rule 2: Critical - Revenue Table Freshness

Rule 3: High - Production Schema Changes

Rule 4: High - Data Freshness Violations

Rule 5: High - Discovery Failures

On-Call Runbook

When Paged for Schema Change

When Paged for Freshness Violation

Slack Integration Best Practices

Channel Setup

Alert Message Format

Maintenance Windows

Scheduled Maintenance

Quick Disable

Metrics to Track

Checklist

Related Resources