uCheckeruChecker
15 min read

Email analytics data pipeline: from webhook to dashboard

Your ESP fires webhooks for every event: deliveries, opens, clicks, bounces, unsubscribes. Thousands per minute. Most teams dump them into PostgreSQL and then complain about a slow dashboard. The dashboard is not the problem. The architecture is. This article walks through a concrete pipeline design that handles millions of events and returns charts in milliseconds.


Overall architecture: four components

The pipeline has four stages. Each one does a single job, so you can swap any stage without breaking the others.

1. Webhook receiver. Accepts HTTP POST requests from your ESP (SendGrid, Mailgun, Postmark, Amazon SES). Validates the signature, normalizes the event format, and publishes to a queue.

2. Message queue. Kafka, RabbitMQ, or Redis Streams. A buffer between ingestion and processing. If ClickHouse goes down for a moment, events are not lost.

3. ETL layer. Reads events from the queue, enriches them (attaches campaign_id, resolves geo from IP, computes time-to-open), and batch-inserts into storage.

4. Storage and visualization. ClickHouse for analytical queries, Grafana for dashboards. Alternatives exist: BigQuery + Looker Studio, or TimescaleDB + Metabase. But for the combination of speed, cost, and flexibility, ClickHouse + Grafana wins at most data volumes.

ESP (SendGrid / Mailgun / SES)
  │
  ▼  HTTP POST (webhook)
┌──────────────────┐
│ Webhook Receiver │  ← signature validation, normalization
└──────┬───────────┘
       │
       ▼  publish
┌──────────────────┐
│  Kafka / Redis   │  ← buffer, delivery guarantee
└──────┬───────────┘
       │
       ▼  consume + transform
┌──────────────────┐
│   ETL Worker     │  ← enrichment, batch insert
└──────┬───────────┘
       │
       ▼  INSERT batch
┌──────────────────┐
│   ClickHouse     │  ← columnar storage
└──────┬───────────┘
       │
       ▼  SQL
┌──────────────────┐
│    Grafana       │  ← dashboards, alerts
└──────────────────┘

Let's go through each component with code.

Webhook receiver: ingestion and validation

Every ESP sends webhooks in its own format. SendGrid batches events into a single POST. Mailgun sends form-encoded data with an HMAC signature. Amazon SES goes through SNS and requires a subscription confirmation step. The receiver's first job is to reduce all of that to a single schema.

Minimum structure for a normalized event:

interface EmailEvent {
  event_id: string;        // idempotency key
  event_type: string;      // delivered | opened | clicked | bounced | complained | unsubscribed
  email: string;           // recipient address
  campaign_id: string;     // campaign identifier
  timestamp: number;       // unix timestamp
  esp: string;             // sendgrid | mailgun | ses
  ip?: string;             // recipient IP (for geo)
  user_agent?: string;     // for client detection
  url?: string;            // for click events
  bounce_type?: string;    // hard | soft
  raw: object;             // original payload for debugging
}

The receiver runs on Node.js with Fastify. Why Fastify over Express: under webhook load, Fastify handles 2–3x more requests per second per core in benchmarks. For an endpoint receiving thousands of requests per minute, that gap matters.

import Fastify from 'fastify';
import crypto from 'crypto';
import { Kafka } from 'kafkajs';

const app = Fastify({ logger: true });
const kafka = new Kafka({ brokers: ['kafka:9092'] });
const producer = kafka.producer();
await producer.connect();

// SendGrid signature verification
function verifySendGridSignature(publicKey, payload, signature, timestamp) {
  const data = timestamp + JSON.stringify(payload);
  const verifier = crypto.createVerify('SHA256');
  verifier.update(data);
  return verifier.verify(publicKey, signature, 'base64');
}

// SendGrid event normalization
function normalizeSendGrid(events) {
  return events.map(e => ({
    event_id: e.sg_event_id,
    event_type: mapEventType(e.event),  // bounce → bounced, open → opened
    email: e.email,
    campaign_id: e.marketing_campaign_id || e.asm_group_id || 'unknown',
    timestamp: e.timestamp,
    esp: 'sendgrid',
    ip: e.ip,
    user_agent: e.useragent,
    url: e.url,
    bounce_type: e.type === 'bounce' ? (e.status?.startsWith('5') ? 'hard' : 'soft') : undefined,
    raw: e,
  }));
}

app.post('/webhooks/sendgrid', async (req, reply) => {
  // 1. Verify signature
  const signature = req.headers['x-twilio-email-event-webhook-signature'];
  const timestamp = req.headers['x-twilio-email-event-webhook-timestamp'];
  if (!verifySendGridSignature(process.env.SG_WEBHOOK_KEY, req.body, signature, timestamp)) {
    return reply.code(401).send({ error: 'invalid signature' });
  }

  // 2. Normalize
  const events = normalizeSendGrid(req.body);

  // 3. Publish to Kafka
  await producer.send({
    topic: 'email-events',
    messages: events.map(e => ({
      key: e.email,            // partition by email address
      value: JSON.stringify(e),
    })),
  });

  return reply.code(200).send({ accepted: events.length });
});

app.listen({ port: 3100, host: '0.0.0.0' });

Three things worth spelling out.

Signature verification is not optional. Without it, anyone can POST to your endpoint and pollute your data. Or worse, trigger business logic with fabricated events. Every ESP provides a signing mechanism: SendGrid uses ECDSA, Mailgun uses HMAC-SHA256, SES uses SNS message signing. Do not skip this step.

Idempotency via event_id. ESPs retry webhook delivery on timeout. Without deduplication, one open becomes three. The event_id field lets you filter duplicates in the ETL stage.

Partition by email address. All events for one recipient land in the same Kafka partition. That preserves ordering: delivered always before opened, opened before clicked. Without partitioning, events arrive out of sequence and your funnel breaks.

ETL: enrichment and batch writes

The consumer reads events from Kafka, adds metadata, and inserts into ClickHouse in batches. Batching is not optional: ClickHouse is designed for large INSERTs. Single-row inserts create one part per insert, the background merge cannot keep up, and the server crashes with "too many parts." Minimum batch: 1,000 rows or every 5 seconds, whichever comes first.

import { Kafka } from 'kafkajs';
import { createClient } from '@clickhouse/client';
import geoip from 'geoip-lite';

const kafka = new Kafka({ brokers: ['kafka:9092'] });
const consumer = kafka.consumer({ groupId: 'etl-email-events' });
const ch = createClient({ host: 'http://clickhouse:8123' });

const BATCH_SIZE = 1000;
const FLUSH_INTERVAL_MS = 5000;
let buffer = [];
let flushTimer = null;

async function flush() {
  if (buffer.length === 0) return;
  const batch = buffer.splice(0, buffer.length);

  await ch.insert({
    table: 'email_events',
    values: batch,
    format: 'JSONEachRow',
  });

  console.log(`Inserted ${batch.length} events into ClickHouse`);
}

function scheduleFlush() {
  if (flushTimer) clearTimeout(flushTimer);
  flushTimer = setTimeout(flush, FLUSH_INTERVAL_MS);
}

function enrichEvent(event) {
  // Geo enrichment from IP
  const geo = event.ip ? geoip.lookup(event.ip) : null;

  return {
    ...event,
    country: geo?.country || 'unknown',
    city: geo?.city || 'unknown',
    // Seconds from send to open
    time_to_open: event.event_type === 'opened' && event.sent_at
      ? event.timestamp - event.sent_at
      : null,
    // Day of week and hour for send-time optimization analysis
    day_of_week: new Date(event.timestamp * 1000).getUTCDay(),
    hour_of_day: new Date(event.timestamp * 1000).getUTCHours(),
    // Date for table partitioning
    event_date: new Date(event.timestamp * 1000).toISOString().split('T')[0],
  };
}

await consumer.connect();
await consumer.subscribe({ topic: 'email-events', fromBeginning: false });

await consumer.run({
  eachMessage: async ({ message }) => {
    const event = JSON.parse(message.value.toString());
    const enriched = enrichEvent(event);
    buffer.push(enriched);

    if (buffer.length >= BATCH_SIZE) {
      await flush();
    } else {
      scheduleFlush();
    }
  },
});

Enrichment adds three categories of data. Geo from IP, for regional reports. Time metrics (time_to_open, day_of_week, hour_of_day), for engagement analysis and send-time optimization. Event date (event_date), for ClickHouse table partitioning.

Deduplication by event_id happens at the ClickHouse level. ReplacingMergeTree with event_id in the sort key removes duplicates during background merges. That is more reliable than filtering in the application: if the ETL worker restarts and reprocesses some messages, ClickHouse handles it automatically.

ClickHouse: table schema

ClickHouse is a columnar database built for analytical queries. Aggregating millions of rows across a few columns is its normal workload. A query for open rate by campaign over the last month runs in tens of milliseconds, not seconds.

CREATE TABLE email_events (
    event_id       String,
    event_type     LowCardinality(String),  -- delivered, opened, clicked, bounced, ...
    email          String,
    campaign_id    String,
    timestamp      UInt32,
    esp            LowCardinality(String),
    ip             Nullable(String),
    user_agent     Nullable(String),
    url            Nullable(String),
    bounce_type    Nullable(String),
    country        LowCardinality(String),
    city           String,
    time_to_open   Nullable(UInt32),
    day_of_week    UInt8,
    hour_of_day    UInt8,
    event_date     Date
)
ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (campaign_id, event_type, email, event_id)
TTL event_date + INTERVAL 12 MONTH
SETTINGS index_granularity = 8192;

A few decisions baked into this schema.

LowCardinality for columns with few distinct values (event_type, esp, country). ClickHouse stores them as dictionary-encoded columns, which compresses data 10–20x and speeds up filtering.

ORDER BY (campaign_id, event_type, email, event_id). The key order matches the most common queries: all events for campaign X, all opens for campaign X, all events for user Y in campaign X. ClickHouse stores data sorted by these columns, so filtering on them runs in O(log n) with no extra indexes.

ReplacingMergeTree with event_id in the sort key. Duplicates are removed during background merges. For analytics that tolerate a small degree of approximation, this is enough. If you need exact deduplication at read time, add FINAL to your SELECT or wrap it in a subquery with argMax.

TTL 12 months. Data older than a year is deleted automatically. Email analytics from last year rarely drives operational decisions. If you need longer retention, raise the limit or configure tiering to S3.

Materialized views: pre-aggregation

A dashboard that scans hundreds of millions of rows on every load will eventually become slow. ClickHouse materialized views solve this by aggregating data at insert time and storing the result in a separate table. Dashboard queries read thousands of rows instead of millions.

-- Aggregate: daily campaign metrics
CREATE MATERIALIZED VIEW campaign_daily_mv
TO campaign_daily_stats
AS SELECT
    event_date,
    campaign_id,
    countIf(event_type = 'delivered')    AS delivered,
    countIf(event_type = 'opened')       AS opened,
    countIf(event_type = 'clicked')      AS clicked,
    countIf(event_type = 'bounced')      AS bounced,
    countIf(event_type = 'complained')   AS complained,
    countIf(event_type = 'unsubscribed') AS unsubscribed,
    -- Rates
    round(countIf(event_type = 'opened') / countIf(event_type = 'delivered') * 100, 2)
        AS open_rate,
    round(countIf(event_type = 'clicked') / countIf(event_type = 'opened') * 100, 2)
        AS click_rate,
    round(countIf(event_type = 'bounced') / (countIf(event_type = 'delivered') + countIf(event_type = 'bounced')) * 100, 2)
        AS bounce_rate
FROM email_events
GROUP BY event_date, campaign_id;

-- Target table for the aggregate
CREATE TABLE campaign_daily_stats (
    event_date   Date,
    campaign_id  String,
    delivered    UInt64,
    opened       UInt64,
    clicked      UInt64,
    bounced      UInt64,
    complained   UInt64,
    unsubscribed UInt64,
    open_rate    Float32,
    click_rate   Float32,
    bounce_rate  Float32
)
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, campaign_id);

SummingMergeTree sums numeric columns when merging rows with the same key. Each batch insert into email_events automatically updates campaign_daily_stats. Grafana reads that table instead of the main one and responds instantly.

For more complex aggregates (unique recipients, median time_to_open), use AggregatingMergeTree with uniqState and quantileState functions. More setup, but you get exact unique counts without a full table scan.

Grafana: a dashboard in 30 minutes

Grafana connects to ClickHouse via the official plugin (grafana-clickhouse-datasource). After that, each panel just runs a SQL query.

The minimum useful panels for an email dashboard:

Open rate over time — a line chart over the last 30 days. A drop of 5+ percentage points signals a deliverability problem.

Bounce rate by campaign — a sortable table. Campaigns above 3% bounce rate need attention: the segment likely contains stale addresses.

Open heatmap — hour_of_day vs day_of_week. Shows when subscribers actually open. Use this for send-time optimization.

Campaign funnel — delivered / opened / clicked. Quick diagnosis: high delivered but low opened means the subject line is the problem. Normal opens but low clicks means the content or CTA.

-- Query for "Open rate over time" panel
SELECT
    event_date AS time,
    round(open_rate, 1) AS "Open Rate %"
FROM campaign_daily_stats
WHERE
    campaign_id = '$campaign'        -- Grafana variable
    AND event_date >= today() - 30
ORDER BY event_date;

-- Query for the open heatmap
SELECT
    day_of_week,
    hour_of_day,
    count() AS opens
FROM email_events
WHERE
    event_type = 'opened'
    AND event_date >= today() - 90
GROUP BY day_of_week, hour_of_day
ORDER BY day_of_week, hour_of_day;

Grafana supports alerts. Set one for bounce rate above 5% over the last 24 hours — that is the threshold where ESPs start throttling sends. Early warning gives you time to pause the campaign and investigate.

Where email validation fits in

The pipeline shows what happened. Validation determines what will happen. If invalid addresses get into a campaign, the pipeline will honestly report 15% bounce rate and a falling open rate. By then the damage is done: domain reputation is down, your ESP has deprioritized your traffic, and you spent money sending to nowhere.

Validation runs before the pipeline — at the segment-building stage. Before launching a campaign, run the list through a check. Addresses with high risk (non-existent, disposable, spam traps) are removed upfront. The pipeline gets clean events, metrics reflect reality, not noise from dead addresses.

Automation is straightforward: add a validation step to your mailing CI/CD, or to a cron job that checks the database once a week. The API takes a list and returns a status for each address. Addresses with status "undeliverable" are excluded from the next campaign.

# Example: validate a list before sending
import requests

API_KEY = "your_api_key"
API_URL = "https://api.uchecker.net/api/v1/validate/single"

def validate_email(email: str) -> dict:
    resp = requests.post(
        API_URL,
        headers={"x-api-key": API_KEY, "Content-Type": "application/json"},
        json={"email": email},
    )
    return resp.json()

def clean_segment(emails: list[str]) -> list[str]:
    clean = []
    for email in emails:
        result = validate_email(email)
        if result.get("result") == "deliverable":
            clean.append(email)
        else:
            print(f"Excluded: {email} — {result.get('result')}")
    return clean

# For large lists use the bulk API instead of single
# POST /api/v1/validate/bulk

The result: bounce rate stays below 2%, the dashboard shows clean metrics, and your ESP does not penalize the domain.

Deployment: what it costs and where to run it

For up to 500,000 subscribers and 3–5 campaigns per week — roughly 10–20 million events per month — a modest setup is enough:

Webhook receiver + ETL worker: one VPS with 2 vCPU, 4 GB RAM. Two Node.js processes. $15–20/month.

Kafka: managed Kafka (Upstash, Confluent Cloud) on the free or entry tier. Or Redis Streams on the same server if you want to keep things simple.

ClickHouse: ClickHouse Cloud (from $50/month), Altinity Cloud, or self-hosted on a VPS with 4 vCPU and 16 GB RAM. For 20 million rows per month with a 12-month TTL (around 240 million rows at peak), 16 GB is sufficient.

Grafana: Grafana Cloud (free tier up to 10,000 metrics) or self-hosted on the same server as ClickHouse.

Total: $70–100/month for a production-ready pipeline. Compare that to SaaS email analytics: Litmus, Validity, Everest charge $500–2,000/month for a comparable set of metrics. Building your own is cheaper and more flexible — but it takes engineering time to set up and maintain.

Common mistakes

Single-row inserts. ClickHouse creates a new part on every INSERT. A thousand single-row inserts per second generates a thousand parts, background merges fall behind, and the server dies with "too many parts." Batch your writes. Minimum 1,000 rows per INSERT.

No buffer. The webhook receiver writes directly to ClickHouse. Your ESP sends 5,000 events in a second — normal behavior when sending 100K+ messages. ClickHouse cannot absorb it fast enough, events are lost. A queue solves this: the receiver handles traffic quickly, and processing happens at its own pace.

Storing the raw payload long-term. Raw JSON from your ESP is useful for debugging, but takes up 80% of the space. Keep raw payloads in a separate table with a short TTL (7–14 days), or drop them entirely once the pipeline is stable.

Ignoring idempotency. ESPs retry webhooks on timeout. Without deduplication, your open rate inflates by 5–15%. The metrics are wrong, and decisions built on them will be wrong too.

Putting it all together

An email analytics pipeline is not complicated. Four components, each with one job. The webhook receiver accepts and normalizes events. The queue buffers them. The ETL worker enriches and batches. ClickHouse stores and aggregates. Grafana visualizes and alerts.

But the pipeline only answers "what happened." Making the metrics good is solved one step earlier. A clean list, validated addresses, no spam traps or disposable inboxes — those are the input conditions under which the pipeline shows a healthy picture. Without them, the dashboard is just a visualization of your problems.

A good dashboard will not improve your metrics. It will show you the truth. And the truth depends on which addresses you are sending to.

Before launching your pipeline, check the list. uChecker shows how many invalid addresses are in your list before they skew your stats.

email analytics pipelineemail data pipelineClickHouse emailwebhook ESPemail events ETLGrafana email dashboardemail deliverability metricsbatch insert ClickHouseemail validation