Alert Best Practices - AnomalyArmor

Effective alerting is about balance: too few alerts and you miss issues; too many and you ignore them all. This guide helps you build an alerting strategy that keeps you informed without overwhelming your team.

The Alert Fatigue Problem

Alert fatigue happens when teams receive too many notifications:

Alert fatigue spiral showing the problem of too many alerts

The goal: Every alert should be actionable and worth investigating.

Core Principles

1. Start Narrow, Expand Carefully

Don’t monitor everything at once:

Week 1: Monitor 5 critical production tables
Week 2: Add freshness monitoring to those tables
Week 3: Expand to 10 more important tables
Week 4: Review alert history, tune thresholds
Continue expanding gradually

2. Every Alert Should Be Actionable

Before creating an alert, ask:

What action should someone take when this fires?
Is immediate action required, or can it wait?
Who is the right person to respond?

If you can’t answer these questions, the alert may not be useful.

3. Match Urgency to Destination

Urgency	Destination	When to Use
Immediate	PagerDuty	On-call response needed now
Soon	Slack	Team should see within hours
Eventually	Email	Can be reviewed daily/weekly

Event-Based Routing

Route different event types based on impact severity:

Recommended Setup

Alert Type	Event	Conditions	Destination
Production breaking changes	Schema Change	Column/table removed	PagerDuty + Slack
Production schema changes	Schema Change	All changes	Slack
Freshness violations	Freshness Violation	SLA breached	Slack
Discovery failures	Discovery Failed	Any failure	Slack + Email
Dev/staging changes	Schema Change	Breaking only	Email

Environment Separation

Monitor different environments differently:

Production

Rules:

All schema changes → Slack + PagerDuty (for breaking)
All freshness violations → Slack
Discovery failures → Slack + Email

Schedule: Hourly discovery | Threshold: Strict SLAs

Staging

Rules:

Breaking changes only → Slack
Freshness (critical tables only) → Slack

Schedule: Every 6 hours | Threshold: Lenient SLAs

Development

Rules:

None or weekly digest only

Schedule: Daily | Threshold: Very lenient or disabled

Threshold Tuning

Start Lenient

If your ETL runs hourly, don’t set a 30-minute SLA:

Pattern	Starting SLA	After Tuning
15 min updates	45 min	30 min
Hourly updates	3 hours	2 hours
Daily updates	36 hours	24 hours

Use Warning Thresholds

Two-stage alerts reduce surprise violations: orders table freshness:

Expected: Updated hourly
Warning: After 90 minutes (alert to Slack)
Violation: After 2 hours (alert to PagerDuty)

Warnings give you time to investigate before escalation.

Review and Tighten

After 2-4 weeks:

Check alert history
Identify alerts that fired but weren’t actionable
Tighten thresholds that never trigger
Loosen thresholds that trigger too often

Scope Filtering

Include Only What Matters

Filter rules to relevant assets: Rule: Production Revenue Freshness

Data source: production-postgres
Schema: public
Assets: orders, payments, revenue_*, transaction_*

Exclude Noise

Remove assets that don’t need monitoring: Exclusions:

*_temp (temporary tables)
*_backup (backup copies)
*_old (deprecated tables)
pg_temp_* (PostgreSQL temp)
test_* (test tables)

Alert Aggregation

Avoid alert storms by grouping related alerts:

Same Asset, Multiple Changes

Instead of:

Column added: new_field_1
Column added: new_field_2
Column added: new_field_3
Column type changed: status

AnomalyArmor groups:

Schema Change: 4 changes detected
- 3 columns added
- 1 column type changed
- View details →

Deduplication

The same change won’t re-alert until resolved or a cooldown period passes.

Common Mistakes

Alerting on everything

Problem: Every table, every change, every environment → hundreds of alertsSolution: Start with 5-10 critical tables. Expand only after you’ve proven the value.

Same destination for everything

Problem: All alerts go to Slack → important ones get buriedSolution: Use event-based routing. PagerDuty for breaking changes, Slack for schema changes, Email for informational.

Too-tight SLAs

Problem: Freshness SLA is 1 hour, but ETL sometimes takes 70 minutes → constant false positivesSolution: Set SLA at 2x expected, tune down over time.

Monitoring dev environments

Problem: Dev databases change constantly → alert stormSolution: Don’t monitor dev at all, or use weekly email digests only.

No one owns the alerts

Problem: Alerts fire but no one respondsSolution: Define ownership for each alert type. Use PagerDuty with on-call rotations for critical alerts.

Weekly Review Process

Schedule 15-30 minutes weekly to review alerts:

Questions to Ask

How many alerts fired this week?
- If more than 50: Too many. Add filters or raise thresholds.
- If fewer than 5: Are you monitoring enough?
What percentage were actionable?
- Target: >80%
- If lower: Identify patterns and add filters
Were any issues missed?
- If yes: Add coverage for those scenarios
Which alerts took longest to resolve?
- These may need better routing or documentation

Tuning Actions

Finding	Action
Alert fires often but isn’t actioned	Disable or change to email digest
Same asset alerts repeatedly	Investigate root cause, not just the alert
Critical issue wasn’t alerted	Add coverage
Team ignores channel	Reduce volume or change channel

Sample Alert Configuration

Here’s a recommended starting configuration:

Rule	Event	Scope	Conditions	Destinations
Production Breaking Changes	Schema Change	Production database, all schemas	Column removed OR Table removed	PagerDuty, Slack #incidents
Production Schema Changes	Schema Change	Production database, all schemas	All changes	Slack #data-alerts
Critical Table Freshness	Freshness Violation	orders, payments, users, products	SLA from asset config	Slack #data-alerts, PagerDuty (if >4h stale)
Analytics Freshness	Freshness Violation	daily_, weekly_, analytics_*	SLA from asset config	Slack #analytics-team
Discovery Failures	Discovery Failed	All	All failures	Slack #data-alerts, Email ops@company.com
Staging Changes (Breaking)	Schema Change	Staging database	Column/table removed	Email (daily digest)

Checklist

Before going live with alerts:

Defined critical tables (start with 5-10)
Set up event-based routing (breaking → PagerDuty, others → Slack)
Excluded dev/test environments
SLAs set with buffer (2x expected)
Warning thresholds configured
Assigned ownership for each alert type
Scheduled weekly review meeting
Documented escalation process

Use Schedules and Blackouts

Reduce noise by controlling when alerts fire:

Operating Schedules

Assign operating schedules to rules that only matter during business hours:

Freshness rules: If your pipelines run overnight, set schedules to only alert during business hours when the team can respond
Non-critical schema changes: Alert during work hours, suppress overnight
Development environments: Restrict to CI/CD windows

Blackout Windows

Use blackout windows for planned quiet periods:

Deployment windows: Suppress alerts during known release times
Holiday freezes: Create yearly recurring blackouts for company holidays
Maintenance periods: Silence alerts during planned infrastructure work

Combine schedules and blackouts: schedules handle recurring weekly patterns, blackouts handle specific date ranges. Both keep your team focused on alerts they can act on.

Alert Rules

Configure alert rules

Freshness Monitoring

Set up freshness SLAs

Slack Integration

Configure Slack alerts

Alerts Overview

Alert system architecture

Operating Schedules

Control when rules are active

Blackout Windows

Suppress alerts during maintenance

Getting Started

Core Concepts

Data Sources

Detect Schema Changes

Monitor Data Health

Coverage Tiers

Get Notified

Understand Your Data

Organize & Tag

Guides

Account & Settings

Security

Help

Downloads

​The Alert Fatigue Problem

​Core Principles

​1. Start Narrow, Expand Carefully

​2. Every Alert Should Be Actionable

​3. Match Urgency to Destination

​Event-Based Routing

​Recommended Setup

​Environment Separation

​Production

​Staging

​Development

​Threshold Tuning

​Start Lenient

​Use Warning Thresholds

​Review and Tighten

​Scope Filtering

​Include Only What Matters

​Exclude Noise

​Alert Aggregation

​Same Asset, Multiple Changes

​Deduplication

​Common Mistakes

​Weekly Review Process

​Questions to Ask

​Tuning Actions

​Sample Alert Configuration

​Checklist

​Use Schedules and Blackouts

​Operating Schedules

​Blackout Windows

​Related Topics

Alert Rules

Freshness Monitoring

Slack Integration

Alerts Overview

Operating Schedules

Blackout Windows

The Alert Fatigue Problem

Core Principles

1. Start Narrow, Expand Carefully

2. Every Alert Should Be Actionable

3. Match Urgency to Destination

Event-Based Routing

Recommended Setup

Environment Separation

Production

Staging

Development

Threshold Tuning

Start Lenient

Use Warning Thresholds

Review and Tighten

Scope Filtering

Include Only What Matters

Exclude Noise

Alert Aggregation

Same Asset, Multiple Changes

Deduplication

Common Mistakes

Weekly Review Process

Questions to Ask

Tuning Actions

Sample Alert Configuration

Checklist

Use Schedules and Blackouts

Operating Schedules

Blackout Windows

Related Topics