The Goal
Architecture Overview
Setting Up PagerDuty Integration
Step 1: Create PagerDuty Service
In PagerDuty:- Go to Services β New Service
- Name:
Data Observability - AnomalyArmor - Integration: Select Events API V2
- Copy the Integration Key
Step 2: Add PagerDuty Destination in AnomalyArmor
- Go to Alerts β Destinations
- Click Add Destination
- Select PagerDuty
- Enter the Integration Key
- Name:
PagerDuty - Data On-Call - Test and Save
Step 3: Configure Escalation Policy
In PagerDuty, set up escalation:Alert Urgency Framework
Define how urgently different data incidents need response:Critical (Page Immediately)
Criteria:- Production data pipeline completely down
- Core revenue tables missing or stale >4 hours
- Discovery failures for >24 hours
- Column removed from
orderstable paymentstable data >4 hours stale- Canβt connect to production database
High (Respond Within 4 Hours)
Criteria:- Important tables stale (1-4 hours)
- Schema changes in production
- Non-critical discovery failures
- Column type changed in production
- Analytics tables 2 hours stale
- Staging discovery failed
Medium (Respond Within 24 Hours)
Criteria:- Non-production schema changes
- Warning thresholds reached
- New assets discovered
- Staging schema changed
- Freshness approaching SLA (warning)
- New table discovered in production
Low (Informational)
Criteria:- Development changes
- Expected changes
- Routine discoveries
Alert Rule Configuration
Rule 1: Critical - Production Breaking Changes
| Field | Value |
|---|---|
| Name | CRITICAL - Production Breaking Changes |
| Event | Schema Change Detected |
| Data source | production-* |
| Schema | public, analytics |
| Change type | Column Removed, Table Removed |
| Destinations | PagerDuty (Data On-Call), Slack #data-incidents |
Rule 2: Critical - Revenue Table Freshness
| Field | Value |
|---|---|
| Name | CRITICAL - Revenue Data Stale |
| Event | Freshness Violation |
| Assets | orders, payments, revenue_* |
| SLA exceeded by | >4 hours |
| Destinations | PagerDuty (Data On-Call), Slack #data-incidents |
Rule 3: High - Production Schema Changes
| Field | Value |
|---|---|
| Name | Production Schema Changes |
| Event | Schema Change Detected |
| Data source | production-* |
| Change type | All |
| Destinations | Slack #data-incidents |
Rule 4: High - Data Freshness Violations
| Field | Value |
|---|---|
| Name | HIGH - Data Freshness Violations |
| Event | Freshness Violation |
| Data source | production-* |
| Condition | SLA exceeded |
| Destinations | Slack #data-incidents |
Rule 5: High - Discovery Failures
| Field | Value |
|---|---|
| Name | HIGH - Discovery Failures |
| Event | Discovery Failed |
| Data source | production-* |
| Destinations | Slack #data-incidents, Email data-platform@company.com |
On-Call Runbook
When Paged for Schema Change
When Paged for Freshness Violation
- ACKNOWLEDGE the alert
-
CHECK ETL STATUS
- Is the ETL job running? Failed? Stuck?
- Check Airflow/Dagster/orchestrator
-
CHECK SOURCE SYSTEM
- Is the source database accessible?
- Is source data actually updating?
-
IDENTIFY ROOT CAUSE
- ETL failure β Fix and restart
- Source delay β Communicate delay
- Connection issue β Troubleshoot connection
-
MITIGATE
- Restart failed jobs
- Notify stakeholders of delay
- RESOLVE and document
Slack Integration Best Practices
Channel Setup
Slack Channels:#data-incidents- Breaking changes (notifications on)#data-alerts- All schema changes (lower priority)#data-digest- Daily/weekly summaries
Alert Message Format
AnomalyArmor alerts include:Maintenance Windows
Scheduled Maintenance
Before planned changes:- Go to Alerts β Rules
- Toggle OFF relevant rules
- Set a reminder to re-enable (e.g., calendar event)
- Proceed with maintenance
- Verify changes detected correctly
- Toggle rules back ON
Quick Disable
For unexpected but known issues, quickly disable a rule:- Go to Alerts β Rules
- Find the rule
- Toggle it OFF
- Remember to re-enable when the issue is resolved
Metrics to Track
| Metric | Target | How to Measure |
|---|---|---|
| MTTD (Time to Detect) | < 1 hour | Discovery frequency |
| MTTN (Time to Notify) | < 5 min | Alert β PagerDuty time |
| MTTR (Time to Resolve) | < 4 hours | Alert β Resolution time |
| False Positive Rate | < 20% | Alerts ignored / Total alerts |
| Pager Load | < 5/week | Critical alerts per week |
Checklist
Before going live with on-call alerting:- PagerDuty integration configured
- Escalation policy set up
- Critical/High/Medium/Low rules defined
- Slack channels created and configured
- On-call runbook documented
- Team trained on response procedures
- Test alert sent and verified
Related Resources
PagerDuty Setup
Detailed PagerDuty integration guide
Alert Best Practices
Reduce alert fatigue
