Implementing The Smtp Protocol: Mail Transfer Agent With Queue And Retry Logic

2026-05-21 · Leonardo Benicio

Summary

A comprehensive technical exploration of implementing the smtp protocol: mail transfer agent with queue and retry logic, covering key concepts, practical implementations, and real-world applications.

Contents

Introduction: Implementing the SMTP Protocol – A Mail Transfer Agent with Queue and Retry Logic

The Invisible Miracle of Email Delivery

Every day, over 300 billion emails traverse the globe. You hit “send” on a message, and within seconds – sometimes milliseconds – it appears in your recipient’s inbox, possibly halfway across the world. It feels like magic. But if you lift the hood, you’ll find a surprisingly resilient, old-school protocol that has been the backbone of electronic mail since the early 1980s: SMTP – the Simple Mail Transfer Protocol.

SMTP is anything but simple when you try to implement it robustly. Its simplicity as a command‑driven ASCII protocol belies the complexity of making email reliable. Lost connections, temporary server outages, rate limiting, greylisting, and spam filters—every link in the chain can fail. Yet email delivery somehow succeeds, time after time. The secret lies in queuing and retry logic, a system of deferred responsibility that underpins every serious Mail Transfer Agent (MTA) like Postfix, Sendmail, or Exim.

This blog post will take you deep into implementing your own SMTP‑based MTA with a production‑grade queue and retry engine. We’ll begin with a raw understanding of SMTP, then build a resilient delivery pipeline step by step. Along the way, you’ll learn why naively sending emails directly to MX servers is a recipe for disaster, and how exponential backoff, persistent queues, and graceful error handling turn a fragile prototype into a reliable workhorse.

Why Bother Building Your Own MTA?

Modern software developers rarely need to implement SMTP from scratch. High‑quality libraries and cloud services (SendGrid, Mailgun, SES) abstract away the pain. Yet understanding how an MTA works internally is invaluable for anyone who touches distributed systems, reliability engineering, or low‑level networking. It teaches you:

How to design idempotent, retry‑safe operations – email delivery must be at‑most‑once or at‑least‑once, and how you handle duplicates matters.
How to build a persistent queue – durable storage for tasks that may take hours or days to complete.
How to implement exponential backoff with jitter – the difference between a polite retry and a thundering herd.
How to manage concurrency and rate limiting – respecting upstream server limits without overwhelming your own resources.
How to handle partial failures – one recipient might succeed, another may bounce, all within a single transaction.

Beyond these technical lessons, building an MTA gives you a deep appreciation for the unsung reliability of email. It is a textbook example of a fault‑tolerant distributed system, and implementing one yourself turns abstract concepts into concrete experience.

In this post, we will walk through the full implementation of a minimal but production‑ready MTA. We’ll start with the SMTP protocol itself: how to connect to a remote Mail eXchange (MX) server, negotiate capabilities, issue commands, and gracefully close. Then we’ll design a queue that persists messages to disk, supports dequeue operations, and stores delivery metadata. We’ll layer retry logic on top, using a state machine to decide when to retry, when to bounce, and when to give up. Finally, we’ll discuss advanced topics like delivery status notifications, handling of temporary vs. permanent failures, and monitoring.

All code examples will be in Python 3, using only the standard library (plus aiosmtpd for a mock listening server). But the concepts apply directly to any language. By the end, you’ll have a working MTA capable of routing real email across the internet, and more importantly, a deep understanding of the engineering that makes the invisible miracle possible.

Part 1: Understanding the SMTP Protocol

A Brief History

SMTP was first defined in RFC 821 in 1982, and later updated by RFC 2821 and then RFC 5321, which is the current standard. Its design is intentionally simple: a client opens a TCP connection to port 25 of a mail server, sends ASCII commands, and receives numeric response codes. The conversation is line‑oriented, human‑readable, and easy to debug with a telnet client.

Despite its age, SMTP has survived because of its extensibility. The EHLO command allows servers to advertise capabilities (STARTTLS, Pipelining, 8BITMIME, SMTPUTF8, etc.), and the protocol itself can be extended with new service extensions. This flexibility, combined with the fact that email infrastructure is one of the most backward‑compatible systems ever built, keeps SMTP at the heart of mail delivery.

The Delivery Flow: From Submission to Inbox

Understanding the full path of an email helps frame the MTA’s job:

User Agent (MUA) – Outlook, Gmail, Thunderbird – composes an email and submits it to a Mail Submission Agent (MSA) on port 587 (or 465 for implicit TLS).
The MSA accepts the message, adds headers (Date, Message-ID, Received), and hands it to a Mail Transfer Agent (MTA).
The MTA looks up the recipient’s domain to find the MX records (Mail Exchanger servers).
The MTA opens an SMTP connection to one of the MX servers (trying them in priority order) and delivers the message.
The receiving MTA may perform spam checks, verify domain reputation, and either deliver the message to the local mailbox or forward it to another MTA.
Eventually, the final MTA hands the message to a Mail Delivery Agent (MDA) (e.g., dovecot, procmail) which writes it to the recipient’s mailbox.

Our focus will be on steps 3 and 4: the MTA that accepts outbound messages from an MSA and delivers them to remote MX servers. This MTA is responsible for queuing, retrying, and reporting failures.

SMTP Command‑Response Structure

An SMTP session consists of a dialog between client and server, each line starting with a command (4 characters, e.g., HELO, MAIL, RCPT, DATA) or a response (3‑digit code followed by text). A typical delivery transcript looks like this:

S: 220 mail.example.com ESMTP Postfix
C: EHLO mymta.local
S: 250-mail.example.com
S: 250-PIPELINING
S: 250-SIZE 10240000
S: 250-VRFY
S: 250-ETRN
S: 250-STARTTLS
S: 250-ENHANCEDSTATUSCODES
S: 250-8BITMIME
S: 250-DSN
S: 250 SMTPUTF8
C: MAIL FROM:<sender@example.com>
S: 250 2.1.0 Ok
C: RCPT TO:<recipient@otherdomain.com>
S: 250 2.1.5 Ok
C: DATA
S: 354 End data with <CRLF>.<CRLF>
C: (message content, headers, blank line, body)
C: .
S: 250 2.0.0 Ok: queued as 12345
C: QUIT
S: 221 2.0.0 Bye

Each response code indicates success (2xx), temporary failure (4xx), or permanent failure (5xx). The MTA must interpret these codes to decide whether to retry later or bounce the mail.

A Simple SMTP Client in Python

Let’s implement a minimal SMTP client that can deliver a single message to a given MX server. We’ll use the standard library’s smtplib for simplicity, but later we’ll need more control over timeouts and retries.

import smtplib

def deliver_via_mx(message: str, sender: str, recipients: list[str], mx_host: str, port: int = 25) -> bool:
    try:
        server = smtplib.SMTP(mx_host, port, timeout=10)
        server.ehlo()
        # Optionally STARTTLS
        if server.has_extn('STARTTLS'):
            server.starttls()
            server.ehlo()
        server.sendmail(sender, recipients, message)
        server.quit()
        return True
    except smtplib.SMTPRecipientsRefused as e:
        # Permanent failure for some recipients
        print(f"Permanent failure: {e}")
        raise
    except smtplib.SMTPServerDisconnected as e:
        # Temporary failure – should retry
        print(f"Connection dropped: {e}")
        raise
    except Exception as e:
        print(f"Unexpected error: {e}")
        raise

This function works, but it’s too naive. If the MX server returns a 4xx error after RCPT TO, smtplib raises an exception, but we lose the distinction between temporary and permanent. Worse, if the connection fails, we have no idea if the message was partially delivered. That’s why we need our own queue and a more careful state machine.

Part 2: Designing the Queue

Why a Queue?

Email delivery is asynchronous by nature. When you submit a message, the MTA cannot guarantee immediate delivery. The remote server may be temporarily unreachable, rate‑limited, or greylisting. Instead of blocking the sender, the MTA queues the message and attempts delivery later. This decoupling is essential for reliability.

A good message queue for an MTA should have:

Persistence: If the MTA crashes, messages must survive.
At‑least‑once delivery: A message should be delivered at least once (duplicates are better than loss).
Idempotency support: The delivery logic should be able to handle duplicate attempts without corrupting state.
Priority ordering: Some messages (like delivery failure notifications) may need expedited delivery.
Retry metadata: Store the number of attempts, last attempt time, next scheduled time, and transient response from server.
Atomic operations: Dequeue and update status should be atomic to avoid lost messages.

For our implementation, we’ll use a simple SQLite database as the queue. SQLite is portable, embedded, and supports ACID transactions. In a production system, you might use PostgreSQL or a dedicated message queue like RabbitMQ, but SQLite is perfect for learning.

Data Model

We’ll define two tables: messages (the raw email content plus envelope information) and delivery_attempts (per‑recipient retry state). However, since SMTP delivery is per‑envelope (MAIL FROM and RCPT TO), and a single message can have multiple recipients, it’s common to group recipients by domain and deliver in a single SMTP session. For simplicity, we’ll store one row per message‑recipient pair.

CREATE TABLE queue (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    sender TEXT NOT NULL,
    recipient TEXT NOT NULL,
    message_bytes BLOB NOT NULL,  -- raw RFC5322 message
    domain TEXT NOT NULL,          -- extracted from recipient
    created_at REAL NOT NULL,      -- Unix timestamp
    next_attempt REAL,             -- when to attempt delivery
    attempt_count INTEGER DEFAULT 0,
    max_attempts INTEGER DEFAULT 10,
    status TEXT DEFAULT 'pending', -- pending, delivering, success, permanent_failure, expired
    last_response TEXT,             -- last SMTP response message
    delivered_at REAL,              -- timestamp if successful
    UNIQUE(sender, recipient, message_bytes)  -- prevent duplicate enqueues
);

We store the message as raw bytes (including headers) to avoid encoding issues later. The domain field is extracted from recipient (after @) to group deliveries by MX.

Enqueueing a Message

When the MSA submits a message, the MTA parses the envelope and enqueues each recipient.

import sqlite3
import time
import re

def enqueue(sender: str, recipient: str, message: bytes, db_path: str = "queue.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")
    try:
        domain = recipient.split('@')[1]
        conn.execute("""
            INSERT OR IGNORE INTO queue (sender, recipient, message_bytes, domain, created_at, next_attempt)
            VALUES (?, ?, ?, ?, ?, ?)
        """, (sender, recipient, message, domain, time.time(), time.time()))
        conn.commit()
    finally:
        conn.close()

We use INSERT OR IGNORE to avoid duplicate entries for the same sender/recipient/message content. In production, you might use a Message‑ID header as the deduplication key.

Dequeueing for Delivery

The delivery worker needs to fetch the next eligible message (where status == 'pending' and next_attempt <= now). To prevent multiple workers from grabbing the same message, we mark it as 'delivering' atomically:

def claim_next_message(db_conn) -> dict | None:
    cursor = db_conn.execute("""
        UPDATE queue SET status = 'delivering', last_response = NULL
        WHERE id="i7eb739af5f"
            SELECT id FROM queue
            WHERE status = 'pending' AND next_attempt <= ?
            ORDER BY next_attempt ASC
            LIMIT 1
        )
        RETURNING *
    """, (time.time(),))
    row = cursor.fetchone()
    if row:
        return {
            "id": row[0],
            "sender": row[1],
            "recipient": row[2],
            "message": row[3],
            "domain": row[4],
            "attempt_count": row[6],
        }
    return None

This uses UPDATE ... RETURNING to atomically claim the row and return its data. If the update fails due to concurrent access, the transaction rolls back safely.

Part 3: MX Resolution and Server Selection

Before we can deliver, we must find which server to connect to. The MTA looks up MX records for the recipient’s domain. MX records have a priority number; lower numbers are preferred. The client must try servers in order by priority, and if a server with a given priority fails, it should try another server of the same priority (or a higher priority) before giving up.

DNS Lookup in Python

We can use socket.getaddrinfo for A/AAAA records, but SMTP requires MX lookup. Python’s dnspython library is the most robust, but to avoid external dependencies, we can use the standard resolver from socket or call nslookup? Actually, the standard library does not include an MX query function. We’ll write a simple fallback using subprocess or use dnspython (which is often pre‑installed). For this post, we’ll assume dnspython is available (install with pip install dnspython).

import dns.resolver

def resolve_mx(domain: str) -> list[tuple[int, str]]:
    try:
        answers = dns.resolver.resolve(domain, 'MX')
    except (dns.resolver.NoAnswer, dns.resolver.NXDOMAIN):
        # If no MX, try A/AAAA (as per RFC 5321)
        return [(0, domain)]  # treat domain itself as MX
    except dns.exception.DNSException as e:
        # Temporary DNS failure -> should retry
        raise RetryableError(f"DNS failure for {domain}: {e}")

    mx_records = [(answer.preference, str(answer.exchange).rstrip('.')) for answer in answers]
    mx_records.sort(key=lambda x: x[0])  # lower number = higher priority
    return mx_records

Server Selection Algorithm

After we have a sorted list, we iterate over groups of same‑priority servers. For each server, we attempt the SMTP connection. If it fails with a temporary error, we move to the next server in the same priority. If all servers in a priority level fail, we move to the next priority. If all servers fail, the message gets a temporary failure and is requeued.

In practice, you need to respect DNS TTLs to avoid hammering the resolver, and you should cache MX records. For simplicity, we’ll look afresh each time.

Part 4: Delivery Worker and Retry Logic

The Delivery Loop

The MTA runs one or more worker threads (or asyncio tasks) that continuously:

Claim the next message from the queue (with status='pending').
Resolve MX for the recipient’s domain.
Attempt SMTP delivery.
Based on the outcome:
- Success: mark as 'success', set delivered_at.
- Permanent failure (5xx): mark as 'permanent_failure', set last_response.
- Temporary failure (4xx) or network error: update attempt_count, compute next retry time using exponential backoff, re‑set status to 'pending'.
- If attempt_count >= max_attempts: mark as 'expired'.

Exponential Backoff with Jitter

Naive backoff doubles the delay each time: 1 min, 2 min, 4 min, … but if many messages were queued at the same time, they will all retry simultaneously after the same delay, causing a thundering herd. Adding jitter (random variation) spreads out the retries.

A common formula (from AWS and others) is:

sleep = min(cap, base * 2^attempt) * (1 + random_between(0, jitter_factor))

For email, typical values: base = 60 seconds, max_delay = 24*3600 seconds (24 hours), and jitter_factor = 0.25. We’ll also add an initial delay before the first retry (say 30 seconds) to handle the case where the server is momentarily overloaded.

import random

def compute_next_retry(attempt_count: int, base: float = 60.0, cap: float = 86400.0, jitter: float = 0.25) -> float:
    delay = min(cap, base * (2 ** (attempt_count - 1)))  # first retry: base*1 = 60s
    delay *= (1 + random.uniform(0, jitter))
    return time.time() + delay

Note: first retry happens after attempt_count becomes 1 (the initial attempt failed). So the first delay is base seconds (with jitter). This matches typical MTAs.

A Smattering of State

We need a function that actually performs the SMTP conversation and returns a delivery result. We’ll define a simple enum:

from enum import Enum

class DeliveryResult(Enum):
    SUCCESS = "success"
    PERMANENT_FAILURE = "permanent_failure"
    TEMPORARY_FAILURE = "temporary_failure"

And a result object:

class DeliveryOutcome:
    def __init__(self, result: DeliveryResult, message: str = ""):
        self.result = result
        self.message = message

Now the delivery attempt function (sketch):

def attempt_delivery(sender: str, recipient: str, message: bytes) -> DeliveryOutcome:
    domain = recipient.split('@')[1]
    try:
        mx_servers = resolve_mx(domain)
    except RetryableError as e:
        return DeliveryOutcome(DeliveryResult.TEMPORARY_FAILURE, str(e))

    for priority, host in mx_servers:
        try:
            with smtplib.SMTP(host, 25, timeout=30) as server:
                server.ehlo()
                # Possibly STARTTLS
                if server.has_extn('STARTTLS'):
                    server.starttls()
                    server.ehlo()
                server.sendmail(sender, [recipient], message)
            return DeliveryOutcome(DeliveryResult.SUCCESS)
        except smtplib.SMTPRecipientsRefused as e:
            # Permanent refusal for this recipient (5xx)
            response = str(e.recipients[recipient][1]) if recipient in e.recipients else str(e)
            return DeliveryOutcome(DeliveryResult.PERMANENT_FAILURE, response)
        except smtplib.SMTPSenderRefused:
            # Usually permanent
            return DeliveryOutcome(DeliveryResult.PERMANENT_FAILURE, "Sender refused")
        except smtplib.SMTPDataError:
            # Could be temporary or permanent depending on code; we default to temporary
            return DeliveryOutcome(DeliveryResult.TEMPORARY_FAILURE, "Data error")
        except (TimeoutError, ConnectionRefusedError, smtplib.SMTPServerDisconnected) as e:
            # Network / transport – temporary, try next MX
            continue
        except smtplib.SMTPException as e:
            return DeliveryOutcome(DeliveryResult.TEMPORARY_FAILURE, str(e))
    # All servers failed with temporary errors
    return DeliveryOutcome(DeliveryResult.TEMPORARY_FAILURE, "All MX servers failed")

Now the worker loop:

def worker_loop():
    conn = sqlite3.connect("queue.db")
    conn.execute("PRAGMA journal_mode=WAL")
    while True:
        row = claim_next_message(conn)
        if not row:
            time.sleep(1)  # idle polling
            continue
        outcome = attempt_delivery(row['sender'], row['recipient'], row['message'])
        if outcome.result == DeliveryResult.SUCCESS:
            conn.execute("UPDATE queue SET status='success', delivered_at=?, last_response=? WHERE id="id2d3df88655d4">,
                         (time.time(), outcome.message, row['id']))
        elif outcome.result == DeliveryResult.PERMANENT_FAILURE:
            conn.execute("UPDATE queue SET status='permanent_failure', last_response=? WHERE id="id2d3df88655d4">,
                         (outcome.message, row['id']))
        else: # temporary failure
            new_attempt_count = row['attempt_count'] + 1
            if new_attempt_count >= row['max_attempts']:
                conn.execute("UPDATE queue SET status='expired', last_response=? WHERE id="id2d3df88655d4">,
                             (outcome.message, row['id']))
            else:
                next_time = compute_next_retry(new_attempt_count)
                conn.execute("""
                    UPDATE queue SET attempt_count=?, last_response=?, next_attempt=?, status='pending'
                    WHERE id="i0b395a8"
                """, (new_attempt_count, outcome.message, next_time, row['id']))
        conn.commit()

We’ve omitted concurrency protection (e.g., row locking) for clarity. In production, you’d use SELECT ... FOR UPDATE or rely on the claim updating status atomically. Also, real workers would handle graceful shutdown and signal handling.

Part 5: Handling Edge Cases and Production Concerns

Greylisting

Many mail servers use greylisting: they temporarily reject an email from an unknown sender with a “try again later” (4xx). The MTA must retry after a delay (usually 60–600 seconds). Our exponential backoff naturally handles this.

Rate Limiting

If you send too many emails to the same remote server in a short period, it may throttle you. To avoid getting blacklisted, you should rate‑limit per domain. A simple approach: maintain a sliding window count of successfully established connections per domain, and delay the next connection if above a threshold. Or use a token bucket.

Implementation:

from collections import defaultdict
import time

class RateLimiter:
    def __init__(self, max_per_minute: int = 10):
        self.max_per_minute = max_per_minute
        self.windows = defaultdict(list)  # domain -> list of timestamps

    def allowed(self, domain: str) -> bool:
        now = time.time()
        # Remove timestamps older than 60 seconds
        self.windows[domain] = [t for t in self.windows[domain] if now - t < 60]
        if len(self.windows[domain]) < self.max_per_minute:
            self.windows[domain].append(now)
            return True
        return False

Then in attempt_delivery, before connecting, check if we are allowed:

if not rate_limiter.allowed(domain):
    return DeliveryOutcome(DeliveryResult.TEMPORARY_FAILURE, "Rate limited")

Persistent Failures: Bounce Handling

When a message receives a permanent failure (e.g., recipient does not exist), the MTA should generate a bounce email (Delivery Status Notification – DSN) back to the sender. Bounce messages are themselves emails that must be enqueued (with special care to avoid bounce loops). This is a large topic; for now we’ll just mark the message as permanently failed.

Concurrency

Running multiple workers can cause duplicate deliveries if they both claim the same message. Our UPDATE ... RETURNING mitigates this, but if two workers run the same update concurrently, only one will get the row (the other gets None). However, if the first worker fails after claiming and before updating, the message may be stuck in 'delivering' state. A watchdog process should periodically “unstick” messages that have been in 'delivering' for too long (e.g., > 5 minutes) by resetting them to 'pending' and incrementing attempt count.

Logging and Monitoring

Every delivery attempt should be logged: timestamp, sender, recipient, server, SMTP response, outcome. For a production MTA, use structured logging (JSON) and ship to a central log aggregator. Also expose metrics via Prometheus: total enqueued, delivered, bounced, expired, current queue depth, delivery latency.

Security Considerations

STARTTLS: Always attempt to upgrade to TLS. Verify server certificate? In an MTA‑to‑MTA context, certificates are optional (opportunistic TLS), but you should still support it.
Limit input size: Reject messages above a configurable size (e.g., 25MB).
Authentication: If your MTA also acts as MSA, require SASL authentication.
Rate limit per sender: Protect your system from abuse.

Part 6: Advanced Queue Design – Priority and Deferred Delivery

Our current queue processes messages in FIFO order by next_attempt. But sometimes you need priorities: delivery receipts (DSN) should be expedited; bulk newsletters can wait. Add a priority column (0=high, 1=normal, 2=low) and order by priority ASC, next_attempt ASC.

Also, some MTAs support scheduled delivery (e.g., “send at 9am my time”). Store an enqueue_time and a scheduled_time; only process when scheduled_time is past.

Stageless Queue Processing with Event‑Driven Triggers

Instead of polling every second, we can use a timer schedule. For each message, compute the next attempt time and insert it into a min‑heap. A scheduler thread sleeps until the top of the heap becomes due. When the due time arrives, it wakes up and processes. This reduces unnecessary polling.

import heapq
import threading

class TimedQueue:
    def __init__(self, db_conn):
        self.heap = []
        self.lock = threading.Lock()
        self.db = db_conn

    def load_pending(self):
        # load all pending items sorted by next_attempt
        rows = self.db.execute("SELECT id, next_attempt FROM queue WHERE status='pending'")
        with self.lock:
            for row in rows:
                heapq.heappush(self.heap, (row[1], row[0]))

    def wait_and_process(self):
        while True:
            with self.lock:
                if self.heap:
                    next_time, msg_id = self.heap[0]
                else:
                    next_time = None
            if next_time is None:
                time.sleep(1)
                continue
            now = time.time()
            if next_time <= now:
                # pop and process
                heapq.heappop(self.heap)
                process_message(msg_id)
            else:
                time.sleep(min(next_time - now, 30))  # wake up at most 30s early

This approach reduces overhead, but you must also handle new enqueues: when a message is enqueued, add it to the heap and notify the scheduler (e.g., via a threading.Event).

Part 7: Testing Your MTA

Testing an MTA against real servers is tricky. You can use a local mock SMTP server that simulates various scenarios. Python’s aiosmtpd is perfect for this.

from aiosmtpd.controller import Controller
import asyncio

class TestHandler:
    async def handle_DATA(self, server, session, envelope):
        # simulate permanent failure for one recipient
        if 'fail@test.com' in envelope.rcpt_tos:
            return '550 No such user'
        return '250 OK'

controller = Controller(TestHandler(), hostname='localhost', port=10025)
controller.start()

Then point your MTA to resolve MX for test.com to localhost:10025 by modifying DNS (e.g., use /etc/hosts or a local DNS server). Write unit tests for queue operations, retry timing, and bounce generation.

Integration Tests

Send an email known to be greylisted: verify it’s retried after >2 minutes.
Drop connection after DATA: verify the MTA retries later (and does not consider it delivered).
Send to a domain with multiple MX servers: kill the primary, verify the MTA falls back to the backup.
Run two workers concurrently: verify each message is delivered exactly once.

Part 8: Scaling and Production Deployment

Database Backends

SQLite works for up to a few hundred messages per minute. Beyond that, move to PostgreSQL with row‑level locking, or use a dedicated queue like RabbitMQ with persistent queues. The concepts remain the same.

Horizontal Scaling

Multiple MTA instances can share the same database (if using PostgreSQL) with careful transaction isolation. Each instance runs its own workers. Use a distributed lock or let the database handle concurrency.

Monitoring and Alerting

Queue depth per status (pending, delivering, expired).
Delivery latency (time from enqueue to delivered).
Error rate per response code.
Retry distribution (how many attempts before success/failure).
Worker health (uptime, claimed messages per second).

Set up alerts for queue depth growing, high expiration rate, or a sudden spike in temporary failures.

Conclusion

Building an MTA from scratch is an excellent exercise in distributed systems thinking. You start with a simple protocol – a few ASCII commands over TCP – and gradually layer on resilience mechanisms: persistent queues, exponential backoff, jitter, rate limiting, and graceful error handling. The result is a system that can survive network partitions, server crashes, and adversarial remote servers, all while delivering billions of messages reliably.

The code we’ve written is a skeleton, but it captures the core logic used by production MTAs like Postfix. Expanding it to a full‑featured system would add:

Message parsing and header manipulation.
Bounce message generation.
DSN support.
Connection reuse and pipelining.
DomainKeys Identified Mail (DKIM) signing.
Sender Policy Framework (SPF) checks.

But even as a sketch, this implementation reveals why email works. It’s not magic. It’s the product of decades of carefully designed retry semantics, queue management, and fault tolerance. Every time you hit send, you’re invoking a distributed system that has been battle‑tested for over 40 years.

Now you have the mental model – and the code – to contribute to that legacy. Fire up a Python REPL, enqueue a message, and watch your own MTA retry its way to the inbox. The invisible miracle becomes visible, and you’ll never look at email the same way again.

If you want to explore the full source code of this project, including tests and a working CLI, check out the companion repository at github.com/example/mta-tutorial (though actually it’s just for illustration).