Skip to main content
Effective alerting is about balance: too few alerts and you miss issues; too many and you ignore them all. This guide helps you build an alerting strategy that keeps you informed without overwhelming your team.

The Alert Fatigue Problem

Alert fatigue happens when teams receive too many notifications: Alert fatigue spiral showing the problem of too many alerts The goal: Every alert should be actionable and worth investigating.

Core Principles

1. Start Narrow, Expand Carefully

Don’t monitor everything at once:
  1. Week 1: Monitor 5 critical production tables
  2. Week 2: Add freshness monitoring to those tables
  3. Week 3: Expand to 10 more important tables
  4. Week 4: Review alert history, tune thresholds
  5. Continue expanding gradually

2. Every Alert Should Be Actionable

Before creating an alert, ask:
  • What action should someone take when this fires?
  • Is immediate action required, or can it wait?
  • Who is the right person to respond?
If you can’t answer these questions, the alert may not be useful.

3. Match Urgency to Destination

UrgencyDestinationWhen to Use
ImmediatePagerDutyOn-call response needed now
SoonSlackTeam should see within hours
EventuallyEmailCan be reviewed daily/weekly

Event-Based Routing

Route different event types based on impact severity: Event-based routing showing different alert types going to appropriate channels
Alert TypeEventConditionsDestination
Production breaking changesSchema ChangeColumn/table removedPagerDuty + Slack
Production schema changesSchema ChangeAll changesSlack
Freshness violationsFreshness ViolationSLA breachedSlack
Discovery failuresDiscovery FailedAny failureSlack + Email
Dev/staging changesSchema ChangeBreaking onlyEmail

Environment Separation

Monitor different environments differently:

Production

Rules:
  • All schema changes → Slack + PagerDuty (for breaking)
  • All freshness violations → Slack
  • Discovery failures → Slack + Email
Schedule: Hourly discovery | Threshold: Strict SLAs

Staging

Rules:
  • Breaking changes only → Slack
  • Freshness (critical tables only) → Slack
Schedule: Every 6 hours | Threshold: Lenient SLAs

Development

Rules:
  • None or weekly digest only
Schedule: Daily | Threshold: Very lenient or disabled

Threshold Tuning

Start Lenient

If your ETL runs hourly, don’t set a 30-minute SLA:
PatternStarting SLAAfter Tuning
15 min updates45 min30 min
Hourly updates3 hours2 hours
Daily updates36 hours24 hours

Use Warning Thresholds

Two-stage alerts reduce surprise violations: orders table freshness:
  • Expected: Updated hourly
  • Warning: After 90 minutes (alert to Slack)
  • Violation: After 2 hours (alert to PagerDuty)
Warnings give you time to investigate before escalation.

Review and Tighten

After 2-4 weeks:
  1. Check alert history
  2. Identify alerts that fired but weren’t actionable
  3. Tighten thresholds that never trigger
  4. Loosen thresholds that trigger too often

Scope Filtering

Include Only What Matters

Filter rules to relevant assets: Rule: Production Revenue Freshness
  • Data source: production-postgres
  • Schema: public
  • Assets: orders, payments, revenue_*, transaction_*

Exclude Noise

Remove assets that don’t need monitoring: Exclusions:
  • *_temp (temporary tables)
  • *_backup (backup copies)
  • *_old (deprecated tables)
  • pg_temp_* (PostgreSQL temp)
  • test_* (test tables)

Alert Aggregation

Avoid alert storms by grouping related alerts:

Same Asset, Multiple Changes

Instead of:
  • Column added: new_field_1
  • Column added: new_field_2
  • Column added: new_field_3
  • Column type changed: status
AnomalyArmor groups:
  • Schema Change: 4 changes detected
    • 3 columns added
    • 1 column type changed
    • View details →

Deduplication

The same change won’t re-alert until resolved or a cooldown period passes.

Common Mistakes

Problem: Every table, every change, every environment → hundreds of alertsSolution: Start with 5-10 critical tables. Expand only after you’ve proven the value.
Problem: All alerts go to Slack → important ones get buriedSolution: Use event-based routing. PagerDuty for breaking changes, Slack for schema changes, Email for informational.
Problem: Freshness SLA is 1 hour, but ETL sometimes takes 70 minutes → constant false positivesSolution: Set SLA at 2x expected, tune down over time.
Problem: Dev databases change constantly → alert stormSolution: Don’t monitor dev at all, or use weekly email digests only.
Problem: Alerts fire but no one respondsSolution: Define ownership for each alert type. Use PagerDuty with on-call rotations for critical alerts.

Weekly Review Process

Schedule 15-30 minutes weekly to review alerts:

Questions to Ask

  1. How many alerts fired this week?
    • If more than 50: Too many. Add filters or raise thresholds.
    • If fewer than 5: Are you monitoring enough?
  2. What percentage were actionable?
    • Target: >80%
    • If lower: Identify patterns and add filters
  3. Were any issues missed?
    • If yes: Add coverage for those scenarios
  4. Which alerts took longest to resolve?
    • These may need better routing or documentation

Tuning Actions

FindingAction
Alert fires often but isn’t actionedDisable or change to email digest
Same asset alerts repeatedlyInvestigate root cause, not just the alert
Critical issue wasn’t alertedAdd coverage
Team ignores channelReduce volume or change channel

Sample Alert Configuration

Here’s a recommended starting configuration:
RuleEventScopeConditionsDestinations
Production Breaking ChangesSchema ChangeProduction database, all schemasColumn removed OR Table removedPagerDuty, Slack #incidents
Production Schema ChangesSchema ChangeProduction database, all schemasAll changesSlack #data-alerts
Critical Table FreshnessFreshness Violationorders, payments, users, productsSLA from asset configSlack #data-alerts, PagerDuty (if >4h stale)
Analytics FreshnessFreshness Violationdaily_, weekly_, analytics_*SLA from asset configSlack #analytics-team
Discovery FailuresDiscovery FailedAllAll failuresSlack #data-alerts, Email ops@company.com
Staging Changes (Breaking)Schema ChangeStaging databaseColumn/table removedEmail (daily digest)

Checklist

Before going live with alerts:
  • Defined critical tables (start with 5-10)
  • Set up event-based routing (breaking → PagerDuty, others → Slack)
  • Excluded dev/test environments
  • SLAs set with buffer (2x expected)
  • Warning thresholds configured
  • Assigned ownership for each alert type
  • Scheduled weekly review meeting
  • Documented escalation process

Use Schedules and Blackouts

Reduce noise by controlling when alerts fire:

Operating Schedules

Assign operating schedules to rules that only matter during business hours:
  • Freshness rules: If your pipelines run overnight, set schedules to only alert during business hours when the team can respond
  • Non-critical schema changes: Alert during work hours, suppress overnight
  • Development environments: Restrict to CI/CD windows

Blackout Windows

Use blackout windows for planned quiet periods:
  • Deployment windows: Suppress alerts during known release times
  • Holiday freezes: Create yearly recurring blackouts for company holidays
  • Maintenance periods: Silence alerts during planned infrastructure work
Combine schedules and blackouts: schedules handle recurring weekly patterns, blackouts handle specific date ranges. Both keep your team focused on alerts they can act on.

Alert Rules

Configure alert rules

Freshness Monitoring

Set up freshness SLAs

Slack Integration

Configure Slack alerts

Alerts Overview

Alert system architecture

Operating Schedules

Control when rules are active

Blackout Windows

Suppress alerts during maintenance