April 5, 202611 min read

How AI Picks the Best Send Time for Each Subscriber

Most ESPs sell Send Time Optimization as a toggle: flip it on, watch open rate climb 10%. What sits behind that toggle is a pipeline: event collection, feature engineering, model training, and inference across millions of addresses. This article breaks STO down to the algorithm and data-structure level, for people who want to know what's actually happening inside.

The problem in formal terms

You have a subscriber set U and a delivery window W, split into slots (usually hourly or half-hourly). For each user u you want the slot t* that maximizes open probability: t*(u) = argmax P(open | u, t). It sounds trivial until you run into what real data looks like.

Input signals: what goes into the model

STO models train on historical engagement events. The minimum signal set used by most implementations:

Open and click timestamps — the primary source. From these you extract hour of day, day of week, and UTC offset. What matters is the user's local hour, not the server's.
Delivery-to-open latency. Message sent at 9:00, opened at 9:47 — latency 47 min. This feature beats absolute open time because it removes the dependency on send hour.
Device type — desktop vs. mobile. Mobile opens cluster in morning and evening; desktop opens track business hours.
Campaign category — promotional, transactional, trigger. Someone may open order notifications instantly but park promo email until evening.
Timezone — from the user's profile or inferred from the IP of the most recent interaction.

Advanced implementations add inbox-check frequency, purchase history, and activity segment. The five-feature baseline covers 80–90% of the usable signal.

Data model: storing engagement history

STO needs an events table with minute-level granularity. A typical schema in a columnar database (ClickHouse, BigQuery):

SQL schema

CREATE TABLE email_events (
  user_id       UInt64,
  campaign_id   UInt64,
  event_type    Enum('send','open','click'),
  event_ts      DateTime64(3),   -- UTC
  tz_offset     Int16,           -- minutes from UTC
  device_type   Enum('desktop','mobile','tablet'),
  campaign_type Enum('promo','transactional','trigger')
) ENGINE = MergeTree()
  ORDER BY (user_id, event_ts);

Store tz_offset at the event level, not the profile. A user's timezone changes with travel or relocation, and a profile-level value goes stale. Derive the offset from the open IP and write it next to the event.

Feature engineering aggregates a per-user profile: open distribution across 24 hourly bins over the last 90 days, median delivery-to-open latency, share of mobile opens. Ninety days is a compromise — shorter gives too few samples, longer and the patterns are stale.

Three modeling approaches

In practice, STO comes in three architectures ordered from simple to complex, each with a different sweet spot.

1. Per-user histogram

Build a histogram of opens across 24 hourly bins for each subscriber. The peak bin becomes the send candidate. Laplace or kernel smoothing prevents overfitting to outliers. Fewer than 5–10 opens? Fall back to a cohort profile averaged by segment.

Pros: transparent, fast, zero inference cost. Cons: ignores interactions between features (day of week, campaign type), works poorly for new subscribers.

2. Gradient boosting (classification per slot)

Frame it as binary classification: for every (user, slot) pair, predict open probability. Features are the user profile aggregates plus slot characteristics (hour, day of week, working/holiday). Train XGBoost or LightGBM on historical sends. At inference, score all 24 slots per user and return argmax.

Python

import lightgbm as lgb
import numpy as np

# features: user profile + slot features
# label: 1 if opened within 2h of delivery, else 0
model = lgb.LGBMClassifier(n_estimators=500, max_depth=6,
    learning_rate=0.05, subsample=0.8, colsample_bytree=0.8)
model.fit(X_train, y_train)

def predict_best_slot(user_features: np.ndarray) -> int:
    scores = []
    for slot in range(24):
        slot_features = encode_slot(slot)  # hour, dow, is_weekend
        x = np.concatenate([user_features, slot_features])
        scores.append(model.predict_proba(x.reshape(1, -1))[0, 1])
    return int(np.argmax(scores))

This is the approach Brevo and Mailchimp use, based on their published technical descriptions. It scales cleanly: one model trains on all users, personalization comes through features. Inference over one million users (24 × 10⁶ pairs) runs in a few minutes on a single GPU.

3. Multi-armed bandit (Thompson Sampling)

Each hourly slot is a bandit arm. For each user, maintain Beta(a_t, b_t) where a_t is the open count for slot t and b_t the non-open count. At send time, sample from each distribution and pick the slot with the highest draw.

The bandit balances exploitation (proven time) with exploration (detect behavioral shifts). New subscribers get natural exploration; users with history converge quickly.

Python

import numpy as np

class ThompsonSTO:
    def __init__(self, n_slots: int = 24):
        self.alpha = np.ones(n_slots)   # opens per slot
        self.beta  = np.ones(n_slots)   # non-opens per slot

    def select_slot(self) -> int:
        samples = np.random.beta(self.alpha, self.beta)
        return int(np.argmax(samples))

    def update(self, slot: int, opened: bool):
        if opened: self.alpha[slot] += 1
        else:      self.beta[slot]  += 1

In production the bandit is often extended with contextual features — day of week, campaign type — giving the flexibility of boosting with the exploration behavior of Thompson Sampling.

Approach	Strength	Limitation
Histogram	Transparent, zero inference cost	Ignores context
Gradient boosting	Uses features, scales well	No exploration, can overfit
Thompson Sampling	Adaptive, explores naturally	Slow convergence without context

Cold start: no history, no problem (mostly)

A new subscriber has no opens, no history. Three strategies:

Cohort fallback. Group users by timezone and signup source. The cohort's aggregated profile becomes the starting point. For B2B lists, “business hours by local time” is already a reasonable guess.
Hierarchical prior. In the bandit approach, initialize beta distribution parameters from cohort aggregates: Beta(a_cohort, b_cohort). The bandit starts informed and adjusts as individual data arrives.
Exploration burst. For the first 3–5 campaigns, send at random slots (B2B: restrict to business hours). This generates training data at the cost of a slight open-rate dip early on.

Production architecture: from model to send

A trained model is useless without infrastructure that plugs predictions into the send pipeline:

Batch inference. Hours before the campaign, run predictions for all recipients. Output: a user_id → best_slot table. For a million-row list with 24 slots each, LightGBM finishes in 2–3 minutes.
Send queue. The scheduler splits recipients into 24 slot groups and enqueues jobs (Redis, RabbitMQ, Kafka). Each group fires at its assigned hour.
Throttling. Blasting 200,000 emails at exactly 10:00 will get you rate-limited or rejected. Within each slot, sends spread over 15–30 minutes.
Feedback loop. Open and click events write back into the events table. The model retrains on a schedule — weekly for boosting, continuously for the bandit.

pipeline

email_events (ClickHouse)
  │
  ▼
feature_engineering (scheduled job, daily)
  │
  ▼
model_training (weekly retrain / continuous bandit update)
  │
  ▼
batch_inference → user_slots table
  │
  ▼
campaign_scheduler → slot queues (Redis)
  │
  ▼
send_workers (throttled, per-slot)
  │
  ▼
ESP / MTA → delivery → webhook events → email_events

Measuring STO effect correctly

You cannot turn on STO and compare open rate to last month. Too many confounders: seasonality, content changes, list size. The right measurement is a user-level A/B test.

The control group (10–20% of the list) receives email at a fixed time — say, 10:00 local. The test group gets the model's prediction. Randomize on user_id % 100 so each person always lands in the same bucket.

Primary metric: open rate. Secondary: click rate, conversion rate, delivery-to-open latency. Watch slot distribution too — if the model collapses 80% of sends into two or three hours, it has overfit to a cohort pattern rather than personalizing.

Realistic numbers

On lists with clean history — valid addresses, accurate tracking — STO delivers +5–15% open rate over a fixed send time. On dirty lists the lift shrinks or disappears: the model trains on noise from dead addresses. For a 500K list, a 5% gain means 25,000 extra opens per campaign with no changes to copy or offer.

What breaks STO in practice

The model runs, the pipeline is wired up, open rate does not move. Four causes that come up most often:

Invalid addresses in the training set. A dead mailbox will not open at any hour. The model treats it as “subscriber who ignores every slot” and tries to find them a time anyway. Result: noise in training, degraded accuracy for active subscribers. Validate before you train.
Apple Mail Privacy Protection. Since iOS 15, Mail prefetches message content at delivery, not at open. The tracking pixel fires on delivery. The model learns on phantom opens. Filtering Apple Mail user-agents from the training set is a blunt fix, but it works.
Wrong timezone. If 30% of users have an incorrect timezone, the model trains on shifted data. IP-based geolocation is unreliable for VPN users. The cleanest source is a JavaScript timezone lookup at subscription time.
Send window too narrow. A business that restricts sends to 9:00–18:00 limits what the model can do. Within that window slot differences are small, and STO produces no measurable lift.

Data quality as the real foundation

All three approaches share one assumption: the events table reflects real actions by live people. When that breaks — invalid addresses, spam traps, disposable mailboxes — the model loses its signal.

In a list of 200,000 addresses, 15% invalid means 30,000 training rows with zero opens across all slots. The model reads these as disengaged subscribers and adjusts. It starts fitting patterns that do not exist. Accuracy drops for everyone.

Validating your list before building STO is not a marketing best practice — it is an engineering requirement, the same data-cleaning step you apply before training any ML model. Garbage in, garbage out.

STO optimizes the delivery moment. If the address does not exist, there is nothing to optimize. A clean list is a precondition for any model that trains on engagement events.

ESP-provided STO vs. building your own

If you use Mailchimp, Brevo, HubSpot, or Salesforce Marketing Cloud, STO is already built in — one toggle in campaign settings. The model trains on aggregated data from all platform customers, which helps with cold start: even a new account gets predictions anchored to collective engagement patterns.

Building your own makes sense under two conditions: the list exceeds 500K subscribers (enough data for an individual model) and there is a team to maintain the pipeline — retraining, monitoring, A/B tests. For everyone else, the ESP's STO is the rational choice.

Implementation checklist

Validate the list. Remove invalid addresses, spam traps, and disposable mailboxes. Without this step the model trains on noise.
Check tracking. Opens should log with UTC timestamps. Filter Apple Mail prefetch. Confirm delivery-to-open latency calculates correctly.
Resolve timezones. IP geolocation at open time or JavaScript at subscribe time. Store the offset at the event level.
Pick an approach. For a start: histogram or the built-in ESP STO. For more control: gradient boosting or contextual bandit.
Run a proper A/B test. Control group 10–20% on a fixed time. Minimum 3–4 campaigns for statistical significance.
Monitor slot distribution. Watch for collapse into a few hours (overfitting), track open rate by group, track regret for the bandit.

Bottom line

Send Time Optimization is a standard ML problem: collect data, engineer features, train, infer, close the feedback loop. The biggest gains come from clean data and accurate tracking. The first step is always list validation.

Before enabling STO, confirm the model will train on clean data. Validate your list in uChecker — address validation, risk scoring, spam-trap and disposable-address filtering.

send time optimizationSTO emailemail send time AImachine learning emailSTO algorithmsThompson Samplingemail validation

← All articles