The Alert Fatigue Problem
Alert fatigue happens when teams receive too many notifications:Core Principles
1. Start Narrow, Expand Carefully
Don’t monitor everything at once:- Week 1: Monitor 5 critical production tables
- Week 2: Add freshness monitoring to those tables
- Week 3: Expand to 10 more important tables
- Week 4: Review alert history, tune thresholds
- Continue expanding gradually
2. Every Alert Should Be Actionable
Before creating an alert, ask:- What action should someone take when this fires?
- Is immediate action required, or can it wait?
- Who is the right person to respond?
3. Match Urgency to Destination
| Urgency | Destination | When to Use |
|---|---|---|
| Immediate | PagerDuty | On-call response needed now |
| Soon | Slack | Team should see within hours |
| Eventually | Can be reviewed daily/weekly |
Event-Based Routing
Route different event types based on impact severity:Recommended Setup
| Alert Type | Event | Conditions | Destination |
|---|---|---|---|
| Production breaking changes | Schema Change | Column/table removed | PagerDuty + Slack |
| Production schema changes | Schema Change | All changes | Slack |
| Freshness violations | Freshness Violation | SLA breached | Slack |
| Discovery failures | Discovery Failed | Any failure | Slack + Email |
| Dev/staging changes | Schema Change | Breaking only |
Environment Separation
Monitor different environments differently:Production
Rules:- All schema changes → Slack + PagerDuty (for breaking)
- All freshness violations → Slack
- Discovery failures → Slack + Email
Staging
Rules:- Breaking changes only → Slack
- Freshness (critical tables only) → Slack
Development
Rules:- None or weekly digest only
Threshold Tuning
Start Lenient
If your ETL runs hourly, don’t set a 30-minute SLA:| Pattern | Starting SLA | After Tuning |
|---|---|---|
| 15 min updates | 45 min | 30 min |
| Hourly updates | 3 hours | 2 hours |
| Daily updates | 36 hours | 24 hours |
Use Warning Thresholds
Two-stage alerts reduce surprise violations: orders table freshness:- Expected: Updated hourly
- Warning: After 90 minutes (alert to Slack)
- Violation: After 2 hours (alert to PagerDuty)
Review and Tighten
After 2-4 weeks:- Check alert history
- Identify alerts that fired but weren’t actionable
- Tighten thresholds that never trigger
- Loosen thresholds that trigger too often
Scope Filtering
Include Only What Matters
Filter rules to relevant assets: Rule: Production Revenue Freshness- Data source: production-postgres
- Schema: public
- Assets:
orders,payments,revenue_*,transaction_*
Exclude Noise
Remove assets that don’t need monitoring: Exclusions:*_temp(temporary tables)*_backup(backup copies)*_old(deprecated tables)pg_temp_*(PostgreSQL temp)test_*(test tables)
Alert Aggregation
Avoid alert storms by grouping related alerts:Same Asset, Multiple Changes
Instead of:- Column added: new_field_1
- Column added: new_field_2
- Column added: new_field_3
- Column type changed: status
- Schema Change: 4 changes detected
- 3 columns added
- 1 column type changed
- View details →
Deduplication
The same change won’t re-alert until resolved or a cooldown period passes.Common Mistakes
Alerting on everything
Alerting on everything
Problem: Every table, every change, every environment → hundreds of alertsSolution: Start with 5-10 critical tables. Expand only after you’ve proven the value.
Same destination for everything
Same destination for everything
Problem: All alerts go to Slack → important ones get buriedSolution: Use event-based routing. PagerDuty for breaking changes, Slack for schema changes, Email for informational.
Too-tight SLAs
Too-tight SLAs
Problem: Freshness SLA is 1 hour, but ETL sometimes takes 70 minutes → constant false positivesSolution: Set SLA at 2x expected, tune down over time.
Monitoring dev environments
Monitoring dev environments
Problem: Dev databases change constantly → alert stormSolution: Don’t monitor dev at all, or use weekly email digests only.
No one owns the alerts
No one owns the alerts
Problem: Alerts fire but no one respondsSolution: Define ownership for each alert type. Use PagerDuty with on-call rotations for critical alerts.
Weekly Review Process
Schedule 15-30 minutes weekly to review alerts:Questions to Ask
-
How many alerts fired this week?
- If more than 50: Too many. Add filters or raise thresholds.
- If fewer than 5: Are you monitoring enough?
-
What percentage were actionable?
- Target: >80%
- If lower: Identify patterns and add filters
-
Were any issues missed?
- If yes: Add coverage for those scenarios
-
Which alerts took longest to resolve?
- These may need better routing or documentation
Tuning Actions
| Finding | Action |
|---|---|
| Alert fires often but isn’t actioned | Disable or change to email digest |
| Same asset alerts repeatedly | Investigate root cause, not just the alert |
| Critical issue wasn’t alerted | Add coverage |
| Team ignores channel | Reduce volume or change channel |
Sample Alert Configuration
Here’s a recommended starting configuration:| Rule | Event | Scope | Conditions | Destinations |
|---|---|---|---|---|
| Production Breaking Changes | Schema Change | Production database, all schemas | Column removed OR Table removed | PagerDuty, Slack #incidents |
| Production Schema Changes | Schema Change | Production database, all schemas | All changes | Slack #data-alerts |
| Critical Table Freshness | Freshness Violation | orders, payments, users, products | SLA from asset config | Slack #data-alerts, PagerDuty (if >4h stale) |
| Analytics Freshness | Freshness Violation | daily_, weekly_, analytics_* | SLA from asset config | Slack #analytics-team |
| Discovery Failures | Discovery Failed | All | All failures | Slack #data-alerts, Email ops@company.com |
| Staging Changes (Breaking) | Schema Change | Staging database | Column/table removed | Email (daily digest) |
Checklist
Before going live with alerts:- Defined critical tables (start with 5-10)
- Set up event-based routing (breaking → PagerDuty, others → Slack)
- Excluded dev/test environments
- SLAs set with buffer (2x expected)
- Warning thresholds configured
- Assigned ownership for each alert type
- Scheduled weekly review meeting
- Documented escalation process
Use Schedules and Blackouts
Reduce noise by controlling when alerts fire:Operating Schedules
Assign operating schedules to rules that only matter during business hours:- Freshness rules: If your pipelines run overnight, set schedules to only alert during business hours when the team can respond
- Non-critical schema changes: Alert during work hours, suppress overnight
- Development environments: Restrict to CI/CD windows
Blackout Windows
Use blackout windows for planned quiet periods:- Deployment windows: Suppress alerts during known release times
- Holiday freezes: Create yearly recurring blackouts for company holidays
- Maintenance periods: Silence alerts during planned infrastructure work
Related Topics
Alert Rules
Configure alert rules
Freshness Monitoring
Set up freshness SLAs
Slack Integration
Configure Slack alerts
Alerts Overview
Alert system architecture
Operating Schedules
Control when rules are active
Blackout Windows
Suppress alerts during maintenance
