Email analytics data pipeline: from webhook to dashboard
Your ESP fires webhooks for every event: deliveries, opens, clicks, bounces, unsubscribes. Thousands per minute. Most teams dump them into PostgreSQL and then complain about a slow dashboard. The dashboard is not the problem. The architecture is. This article walks through a concrete pipeline design that handles millions of events and returns charts in milliseconds.
Overall architecture: four components
The pipeline has four stages. Each one does a single job, so you can swap any stage without breaking the others.
1. Webhook receiver. Accepts HTTP POST requests from your ESP (SendGrid, Mailgun, Postmark, Amazon SES). Validates the signature, normalizes the event format, and publishes to a queue.
2. Message queue. Kafka, RabbitMQ, or Redis Streams. A buffer between ingestion and processing. If ClickHouse goes down for a moment, events are not lost.
3. ETL layer. Reads events from the queue, enriches them (attaches campaign_id, resolves geo from IP, computes time-to-open), and batch-inserts into storage.
4. Storage and visualization. ClickHouse for analytical queries, Grafana for dashboards. Alternatives exist: BigQuery + Looker Studio, or TimescaleDB + Metabase. But for the combination of speed, cost, and flexibility, ClickHouse + Grafana wins at most data volumes.
ESP (SendGrid / Mailgun / SES)
│
▼ HTTP POST (webhook)
┌──────────────────┐
│ Webhook Receiver │ ← signature validation, normalization
└──────┬───────────┘
│
▼ publish
┌──────────────────┐
│ Kafka / Redis │ ← buffer, delivery guarantee
└──────┬───────────┘
│
▼ consume + transform
┌──────────────────┐
│ ETL Worker │ ← enrichment, batch insert
└──────┬───────────┘
│
▼ INSERT batch
┌──────────────────┐
│ ClickHouse │ ← columnar storage
└──────┬───────────┘
│
▼ SQL
┌──────────────────┐
│ Grafana │ ← dashboards, alerts
└──────────────────┘Let's go through each component with code.
Webhook receiver: ingestion and validation
Every ESP sends webhooks in its own format. SendGrid batches events into a single POST. Mailgun sends form-encoded data with an HMAC signature. Amazon SES goes through SNS and requires a subscription confirmation step. The receiver's first job is to reduce all of that to a single schema.
Minimum structure for a normalized event:
interface EmailEvent {
event_id: string; // idempotency key
event_type: string; // delivered | opened | clicked | bounced | complained | unsubscribed
email: string; // recipient address
campaign_id: string; // campaign identifier
timestamp: number; // unix timestamp
esp: string; // sendgrid | mailgun | ses
ip?: string; // recipient IP (for geo)
user_agent?: string; // for client detection
url?: string; // for click events
bounce_type?: string; // hard | soft
raw: object; // original payload for debugging
}The receiver runs on Node.js with Fastify. Why Fastify over Express: under webhook load, Fastify handles 2–3x more requests per second per core in benchmarks. For an endpoint receiving thousands of requests per minute, that gap matters.
import Fastify from 'fastify';
import crypto from 'crypto';
import { Kafka } from 'kafkajs';
const app = Fastify({ logger: true });
const kafka = new Kafka({ brokers: ['kafka:9092'] });
const producer = kafka.producer();
await producer.connect();
// SendGrid signature verification
function verifySendGridSignature(publicKey, payload, signature, timestamp) {
const data = timestamp + JSON.stringify(payload);
const verifier = crypto.createVerify('SHA256');
verifier.update(data);
return verifier.verify(publicKey, signature, 'base64');
}
// SendGrid event normalization
function normalizeSendGrid(events) {
return events.map(e => ({
event_id: e.sg_event_id,
event_type: mapEventType(e.event), // bounce → bounced, open → opened
email: e.email,
campaign_id: e.marketing_campaign_id || e.asm_group_id || 'unknown',
timestamp: e.timestamp,
esp: 'sendgrid',
ip: e.ip,
user_agent: e.useragent,
url: e.url,
bounce_type: e.type === 'bounce' ? (e.status?.startsWith('5') ? 'hard' : 'soft') : undefined,
raw: e,
}));
}
app.post('/webhooks/sendgrid', async (req, reply) => {
// 1. Verify signature
const signature = req.headers['x-twilio-email-event-webhook-signature'];
const timestamp = req.headers['x-twilio-email-event-webhook-timestamp'];
if (!verifySendGridSignature(process.env.SG_WEBHOOK_KEY, req.body, signature, timestamp)) {
return reply.code(401).send({ error: 'invalid signature' });
}
// 2. Normalize
const events = normalizeSendGrid(req.body);
// 3. Publish to Kafka
await producer.send({
topic: 'email-events',
messages: events.map(e => ({
key: e.email, // partition by email address
value: JSON.stringify(e),
})),
});
return reply.code(200).send({ accepted: events.length });
});
app.listen({ port: 3100, host: '0.0.0.0' });Three things worth spelling out.
Signature verification is not optional. Without it, anyone can POST to your endpoint and pollute your data. Or worse, trigger business logic with fabricated events. Every ESP provides a signing mechanism: SendGrid uses ECDSA, Mailgun uses HMAC-SHA256, SES uses SNS message signing. Do not skip this step.
Idempotency via event_id. ESPs retry webhook delivery on timeout. Without deduplication, one open becomes three. The event_id field lets you filter duplicates in the ETL stage.
Partition by email address. All events for one recipient land in the same Kafka partition. That preserves ordering: delivered always before opened, opened before clicked. Without partitioning, events arrive out of sequence and your funnel breaks.
ETL: enrichment and batch writes
The consumer reads events from Kafka, adds metadata, and inserts into ClickHouse in batches. Batching is not optional: ClickHouse is designed for large INSERTs. Single-row inserts create one part per insert, the background merge cannot keep up, and the server crashes with "too many parts." Minimum batch: 1,000 rows or every 5 seconds, whichever comes first.
import { Kafka } from 'kafkajs';
import { createClient } from '@clickhouse/client';
import geoip from 'geoip-lite';
const kafka = new Kafka({ brokers: ['kafka:9092'] });
const consumer = kafka.consumer({ groupId: 'etl-email-events' });
const ch = createClient({ host: 'http://clickhouse:8123' });
const BATCH_SIZE = 1000;
const FLUSH_INTERVAL_MS = 5000;
let buffer = [];
let flushTimer = null;
async function flush() {
if (buffer.length === 0) return;
const batch = buffer.splice(0, buffer.length);
await ch.insert({
table: 'email_events',
values: batch,
format: 'JSONEachRow',
});
console.log(`Inserted ${batch.length} events into ClickHouse`);
}
function scheduleFlush() {
if (flushTimer) clearTimeout(flushTimer);
flushTimer = setTimeout(flush, FLUSH_INTERVAL_MS);
}
function enrichEvent(event) {
// Geo enrichment from IP
const geo = event.ip ? geoip.lookup(event.ip) : null;
return {
...event,
country: geo?.country || 'unknown',
city: geo?.city || 'unknown',
// Seconds from send to open
time_to_open: event.event_type === 'opened' && event.sent_at
? event.timestamp - event.sent_at
: null,
// Day of week and hour for send-time optimization analysis
day_of_week: new Date(event.timestamp * 1000).getUTCDay(),
hour_of_day: new Date(event.timestamp * 1000).getUTCHours(),
// Date for table partitioning
event_date: new Date(event.timestamp * 1000).toISOString().split('T')[0],
};
}
await consumer.connect();
await consumer.subscribe({ topic: 'email-events', fromBeginning: false });
await consumer.run({
eachMessage: async ({ message }) => {
const event = JSON.parse(message.value.toString());
const enriched = enrichEvent(event);
buffer.push(enriched);
if (buffer.length >= BATCH_SIZE) {
await flush();
} else {
scheduleFlush();
}
},
});Enrichment adds three categories of data. Geo from IP, for regional reports. Time metrics (time_to_open, day_of_week, hour_of_day), for engagement analysis and send-time optimization. Event date (event_date), for ClickHouse table partitioning.
Deduplication by event_id happens at the ClickHouse level. ReplacingMergeTree with event_id in the sort key removes duplicates during background merges. That is more reliable than filtering in the application: if the ETL worker restarts and reprocesses some messages, ClickHouse handles it automatically.
ClickHouse: table schema
ClickHouse is a columnar database built for analytical queries. Aggregating millions of rows across a few columns is its normal workload. A query for open rate by campaign over the last month runs in tens of milliseconds, not seconds.
CREATE TABLE email_events (
event_id String,
event_type LowCardinality(String), -- delivered, opened, clicked, bounced, ...
email String,
campaign_id String,
timestamp UInt32,
esp LowCardinality(String),
ip Nullable(String),
user_agent Nullable(String),
url Nullable(String),
bounce_type Nullable(String),
country LowCardinality(String),
city String,
time_to_open Nullable(UInt32),
day_of_week UInt8,
hour_of_day UInt8,
event_date Date
)
ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (campaign_id, event_type, email, event_id)
TTL event_date + INTERVAL 12 MONTH
SETTINGS index_granularity = 8192;A few decisions baked into this schema.
LowCardinality for columns with few distinct values (event_type, esp, country). ClickHouse stores them as dictionary-encoded columns, which compresses data 10–20x and speeds up filtering.
ORDER BY (campaign_id, event_type, email, event_id). The key order matches the most common queries: all events for campaign X, all opens for campaign X, all events for user Y in campaign X. ClickHouse stores data sorted by these columns, so filtering on them runs in O(log n) with no extra indexes.
ReplacingMergeTree with event_id in the sort key. Duplicates are removed during background merges. For analytics that tolerate a small degree of approximation, this is enough. If you need exact deduplication at read time, add FINAL to your SELECT or wrap it in a subquery with argMax.
TTL 12 months. Data older than a year is deleted automatically. Email analytics from last year rarely drives operational decisions. If you need longer retention, raise the limit or configure tiering to S3.
Materialized views: pre-aggregation
A dashboard that scans hundreds of millions of rows on every load will eventually become slow. ClickHouse materialized views solve this by aggregating data at insert time and storing the result in a separate table. Dashboard queries read thousands of rows instead of millions.
-- Aggregate: daily campaign metrics
CREATE MATERIALIZED VIEW campaign_daily_mv
TO campaign_daily_stats
AS SELECT
event_date,
campaign_id,
countIf(event_type = 'delivered') AS delivered,
countIf(event_type = 'opened') AS opened,
countIf(event_type = 'clicked') AS clicked,
countIf(event_type = 'bounced') AS bounced,
countIf(event_type = 'complained') AS complained,
countIf(event_type = 'unsubscribed') AS unsubscribed,
-- Rates
round(countIf(event_type = 'opened') / countIf(event_type = 'delivered') * 100, 2)
AS open_rate,
round(countIf(event_type = 'clicked') / countIf(event_type = 'opened') * 100, 2)
AS click_rate,
round(countIf(event_type = 'bounced') / (countIf(event_type = 'delivered') + countIf(event_type = 'bounced')) * 100, 2)
AS bounce_rate
FROM email_events
GROUP BY event_date, campaign_id;
-- Target table for the aggregate
CREATE TABLE campaign_daily_stats (
event_date Date,
campaign_id String,
delivered UInt64,
opened UInt64,
clicked UInt64,
bounced UInt64,
complained UInt64,
unsubscribed UInt64,
open_rate Float32,
click_rate Float32,
bounce_rate Float32
)
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, campaign_id);SummingMergeTree sums numeric columns when merging rows with the same key. Each batch insert into email_events automatically updates campaign_daily_stats. Grafana reads that table instead of the main one and responds instantly.
For more complex aggregates (unique recipients, median time_to_open), use AggregatingMergeTree with uniqState and quantileState functions. More setup, but you get exact unique counts without a full table scan.
Grafana: a dashboard in 30 minutes
Grafana connects to ClickHouse via the official plugin (grafana-clickhouse-datasource). After that, each panel just runs a SQL query.
The minimum useful panels for an email dashboard:
Open rate over time — a line chart over the last 30 days. A drop of 5+ percentage points signals a deliverability problem.
Bounce rate by campaign — a sortable table. Campaigns above 3% bounce rate need attention: the segment likely contains stale addresses.
Open heatmap — hour_of_day vs day_of_week. Shows when subscribers actually open. Use this for send-time optimization.
Campaign funnel — delivered / opened / clicked. Quick diagnosis: high delivered but low opened means the subject line is the problem. Normal opens but low clicks means the content or CTA.
-- Query for "Open rate over time" panel
SELECT
event_date AS time,
round(open_rate, 1) AS "Open Rate %"
FROM campaign_daily_stats
WHERE
campaign_id = '$campaign' -- Grafana variable
AND event_date >= today() - 30
ORDER BY event_date;
-- Query for the open heatmap
SELECT
day_of_week,
hour_of_day,
count() AS opens
FROM email_events
WHERE
event_type = 'opened'
AND event_date >= today() - 90
GROUP BY day_of_week, hour_of_day
ORDER BY day_of_week, hour_of_day;Grafana supports alerts. Set one for bounce rate above 5% over the last 24 hours — that is the threshold where ESPs start throttling sends. Early warning gives you time to pause the campaign and investigate.
Where email validation fits in
The pipeline shows what happened. Validation determines what will happen. If invalid addresses get into a campaign, the pipeline will honestly report 15% bounce rate and a falling open rate. By then the damage is done: domain reputation is down, your ESP has deprioritized your traffic, and you spent money sending to nowhere.
Validation runs before the pipeline — at the segment-building stage. Before launching a campaign, run the list through a check. Addresses with high risk (non-existent, disposable, spam traps) are removed upfront. The pipeline gets clean events, metrics reflect reality, not noise from dead addresses.
Automation is straightforward: add a validation step to your mailing CI/CD, or to a cron job that checks the database once a week. The API takes a list and returns a status for each address. Addresses with status "undeliverable" are excluded from the next campaign.
# Example: validate a list before sending
import requests
API_KEY = "your_api_key"
API_URL = "https://api.uchecker.net/api/v1/validate/single"
def validate_email(email: str) -> dict:
resp = requests.post(
API_URL,
headers={"x-api-key": API_KEY, "Content-Type": "application/json"},
json={"email": email},
)
return resp.json()
def clean_segment(emails: list[str]) -> list[str]:
clean = []
for email in emails:
result = validate_email(email)
if result.get("result") == "deliverable":
clean.append(email)
else:
print(f"Excluded: {email} — {result.get('result')}")
return clean
# For large lists use the bulk API instead of single
# POST /api/v1/validate/bulkThe result: bounce rate stays below 2%, the dashboard shows clean metrics, and your ESP does not penalize the domain.
Deployment: what it costs and where to run it
For up to 500,000 subscribers and 3–5 campaigns per week — roughly 10–20 million events per month — a modest setup is enough:
Webhook receiver + ETL worker: one VPS with 2 vCPU, 4 GB RAM. Two Node.js processes. $15–20/month.
Kafka: managed Kafka (Upstash, Confluent Cloud) on the free or entry tier. Or Redis Streams on the same server if you want to keep things simple.
ClickHouse: ClickHouse Cloud (from $50/month), Altinity Cloud, or self-hosted on a VPS with 4 vCPU and 16 GB RAM. For 20 million rows per month with a 12-month TTL (around 240 million rows at peak), 16 GB is sufficient.
Grafana: Grafana Cloud (free tier up to 10,000 metrics) or self-hosted on the same server as ClickHouse.
Total: $70–100/month for a production-ready pipeline. Compare that to SaaS email analytics: Litmus, Validity, Everest charge $500–2,000/month for a comparable set of metrics. Building your own is cheaper and more flexible — but it takes engineering time to set up and maintain.
Common mistakes
Single-row inserts. ClickHouse creates a new part on every INSERT. A thousand single-row inserts per second generates a thousand parts, background merges fall behind, and the server dies with "too many parts." Batch your writes. Minimum 1,000 rows per INSERT.
No buffer. The webhook receiver writes directly to ClickHouse. Your ESP sends 5,000 events in a second — normal behavior when sending 100K+ messages. ClickHouse cannot absorb it fast enough, events are lost. A queue solves this: the receiver handles traffic quickly, and processing happens at its own pace.
Storing the raw payload long-term. Raw JSON from your ESP is useful for debugging, but takes up 80% of the space. Keep raw payloads in a separate table with a short TTL (7–14 days), or drop them entirely once the pipeline is stable.
Ignoring idempotency. ESPs retry webhooks on timeout. Without deduplication, your open rate inflates by 5–15%. The metrics are wrong, and decisions built on them will be wrong too.
Putting it all together
An email analytics pipeline is not complicated. Four components, each with one job. The webhook receiver accepts and normalizes events. The queue buffers them. The ETL worker enriches and batches. ClickHouse stores and aggregates. Grafana visualizes and alerts.
But the pipeline only answers "what happened." Making the metrics good is solved one step earlier. A clean list, validated addresses, no spam traps or disposable inboxes — those are the input conditions under which the pipeline shows a healthy picture. Without them, the dashboard is just a visualization of your problems.
A good dashboard will not improve your metrics. It will show you the truth. And the truth depends on which addresses you are sending to.
Before launching your pipeline, check the list. uChecker shows how many invalid addresses are in your list before they skew your stats.
