Automating Email Triage and Responses with Python and OpenAI

Forty-seven unread emails by 9 a.m. Twelve are urgent. Eight need a response you've written a hundred times. The rest are noise you'll archive without reading — if you ever get to them.

This tutorial builds an email triage pipeline: classify incoming emails by urgency and category, generate context-aware draft responses for the routine ones, and flag the rest for human review. You'll need Python 3.10+, an OpenAI API key, and openai installed.

pip install openai python-dotenv

No Gmail integration here — that's a separate 3,000-word article. We'll work with email objects as Python dicts, which means you can drop this classifier into any pipeline that already hands you a subject, sender_name, and body. IMAP, Gmail API, webhook from your mail server — the logic is the same regardless.

The Classification Step

Before you draft anything, you need to know what kind of email you're looking at. A refund request, a partnership inquiry, and a GDPR deletion request all need different handling — and different tones.

Send the subject and body to GPT-4o and ask for a structured classification. Use response_format={"type": "json_object"} to keep the output consistent.

import os
import json
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI()  # reads OPENAI_API_KEY from env

def classify_email(subject: str, body: str) -> dict:
    prompt = f"""Classify this email. Return a JSON object with exactly these fields:
- category: one of "support", "billing", "partnership", "legal", "spam", "internal", "other"
- urgency: one of "high", "medium", "low"
- needs_human: true if this requires a human response, false if a draft is acceptable
- confidence: float between 0 and 1
- summary: one sentence describing what the sender wants

Email subject: {subject}
Email body:
{body}

Return only valid JSON. No explanation."""

    response = client.chat.completions.create(
        model="gpt-4o",  # or gpt-4o-mini for cost-sensitive pipelines
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )

    return json.loads(response.choices[0].message.content)

Temperature zero matters here. Classification isn't creative — you want the same answer every time for the same input. I've watched pipelines miscategorize legal emails as "other" when temperature was left at default, which put someone in a conversation they weren't prepared for.

Generating Draft Responses

Classification tells you what you're looking at. Draft generation is what you do when needs_human is false.

The generator takes the classification output and the original email and produces a response that sounds like it came from someone who actually read the message.

def generate_draft(
    email: dict,
    classification: dict,
    sender_name: str = "",
) -> str:
    category_context = {
        "support": "The sender needs help with a product or service issue.",
        "billing": "The sender has a question or concern about a payment or invoice.",
        "partnership": "The sender is proposing a collaboration or business relationship.",
        "internal": "This is from a colleague or team member.",
        "other": "Handle this as a general inquiry.",
    }

    context = category_context.get(classification["category"], "Handle as a general inquiry.")
    greeting = f"Hi {sender_name}," if sender_name else "Hi,"

    prompt = f"""You are drafting a professional email response.

Context: {context}
Urgency: {classification['urgency']}
What the sender wants: {classification['summary']}

Original email subject: {email['subject']}
Original email body:
{email['body']}

Write a response that:
- Opens with "{greeting}"
- Is concise — under 150 words
- Addresses what the sender actually asked
- Sounds human, not templated
- Does not use "I hope this email finds you well" or "Please do not hesitate"
- Ends with a next step or a specific question, not a generic closing

Write only the email body. No subject line."""

    response = client.chat.completions.create(
        model="gpt-4o",  # or gpt-4o-mini to reduce cost on high-volume inboxes
        messages=[{"role": "user", "content": prompt}],
        temperature=0.4,
    )

    return response.choices[0].message.content.strip()

Temperature is higher here — 0.4 — because drafts should vary. Running the same support email through twice and getting identical phrasing is an easy tell. A bit of variability makes the output feel less mechanical.

When Does Auto-Draft Break?

More often than you'd like.

The classifier struggles with emails that deliberately obscure intent. An angry customer keeping a polite tone scores low urgency when the real risk is high. A refund request phrased as a "general inquiry" ends up miscategorized. Legal emails with casual subject lines slip through to "other" if the body doesn't contain the right signals.

Two patterns help. First: a keyword override layer. Run a regex pass before the LLM call. If the subject or body contains "legal action", "attorney", "lawsuit", "GDPR Article 17", or "chargeback", force needs_human: true and urgency: high regardless of what the model returns. LLMs shouldn't be the last line of defense on legal exposure.

Second: respect the confidence score. The classifier returns a confidence field. Route anything below 0.75 to a human review queue automatically. The model is usually honest about uncertainty — a score of 0.4 is genuinely a borderline case, not the model being cautious.

I don't know where the right threshold sits for every inbox. In a low-volume context (under 200 emails/day) I'd start at 0.75. In a high-volume support queue where routing errors cost agent time, I'd push to 0.8 and tune down from there. It depends on how much you trust the model versus how much your team can absorb the overflow.

Build the Full Pipeline

from dataclasses import dataclass
from typing import Optional

@dataclass
class TriageResult:
    email_id: str
    classification: dict
    draft: Optional[str]
    requires_human: bool

LEGAL_KEYWORDS = {
    "legal action", "attorney", "lawsuit", "chargeback",
    "gdpr article 17", "right to erasure", "arbitration",
}

CONFIDENCE_THRESHOLD = 0.75

def triage_email(email: dict) -> TriageResult:
    subject = email.get("subject", "")
    body = email.get("body", "")
    sender = email.get("sender_name", "")

    # Keyword override — deterministic, no API call needed
    combined_text = f"{subject} {body}".lower()
    if any(kw in combined_text for kw in LEGAL_KEYWORDS):
        classification = {
            "category": "legal",
            "urgency": "high",
            "needs_human": True,
            "confidence": 1.0,
            "summary": "Legal keyword detected — flagged for human review.",
        }
        return TriageResult(
            email_id=email["id"],
            classification=classification,
            draft=None,
            requires_human=True,
        )

    classification = classify_email(subject, body)

    low_confidence = classification.get("confidence", 0) < CONFIDENCE_THRESHOLD
    needs_human = classification.get("needs_human", True) or low_confidence

    draft = None
    if not needs_human and classification.get("category") != "spam":
        draft = generate_draft(email, classification, sender_name=sender)

    return TriageResult(
        email_id=email["id"],
        classification=classification,
        draft=draft,
        requires_human=needs_human,
    )


if __name__ == "__main__":
    test_emails = [
        {
            "id": "email_001",
            "subject": "Invoice #4421 — still hasn't been processed",
            "body": "Hi, I submitted invoice #4421 three weeks ago and haven't received payment. Can you confirm the status? This is the third time I'm following up.",
            "sender_name": "Marcus",
        },
        {
            "id": "email_002",
            "subject": "Partnership opportunity",
            "body": "We're a growth agency working with SaaS companies in your space. Would love to explore whether there's a fit for a referral arrangement.",
            "sender_name": "Priya",
        },
    ]

    for email in test_emails:
        result = triage_email(email)
        print(f"\n--- {email['id']} ---")
        print(f"Category:       {result.classification['category']}")
        print(f"Urgency:        {result.classification['urgency']}")
        print(f"Confidence:     {result.classification.get('confidence', 'N/A')}")
        print(f"Requires human: {result.requires_human}")
        if result.draft:
            print(f"\nDraft:\n{result.draft}")

The keyword check runs before the API call — cheap, deterministic, and the safest layer in the pipeline.

What the Production Version Needs

A few things this tutorial deliberately skips:

Thread context. A standalone email is easy. A thread with five back-and-forth messages is where it gets fiddly — the draft needs to account for what's already been said. Pass the previous 2–3 messages as context to the draft generator, oldest to newest, and tell the model it's generating a reply within a thread rather than responding to a cold email.

Draft staging, not auto-send. Route drafts to a review folder first. Even a 95% accurate classifier gets 1 in 20 emails wrong, and the wrong draft sent in your name is its own kind of incident. Most teams I've seen review drafts for a few weeks before enabling any automation at the send layer.

Logging. Track classifications, confidence scores, and human overrides. After 500 emails you'll have enough data to see which categories the model consistently misses and whether your keyword overrides are firing too broadly.

Rate limiting. Processing a backlog of 500 emails runs into per-minute token limits. Add a short sleep between calls or route them through a job queue if you're hitting the ceiling.

How Much Should You Trust the Classifier?

The confidence threshold is the part I keep second-guessing. Set it too low and you're routing genuinely ambiguous emails to humans who then resent being handed garbage. Set it too high and the pipeline becomes a review queue that happens to have classification labels attached.

The 0.75 number above is a starting point, not a recommendation. I've seen it hold fine in a 150-email-per-day support queue with clean categories and steady volume. I've also seen it flood a review queue where customers routinely sent three-sentence emails with no clear ask — the model returned 0.65 confidence on basically everything, not because it was uncertain about the category, but because the emails themselves were genuinely unclear.

What I don't have is a rule for distinguishing those two situations before you've run the pipeline for a week. The first 200 classifications tell you more about calibration than any threshold heuristic you can set in advance — and until then, you're mostly guessing.