Shamim Shams Shamim Shams

Automate Bug Triage with AI: Classify GitHub Issues Using Python

· 8 min read
Automate Bug Triage with AI: Classify GitHub Issues Using Python

Come back to a side project after three months away and you'll find it waiting for you: a backlog of 40, 80, sometimes 200+ open issues, none of them labeled, none of them prioritized. Someone reported a crash. Someone else asked how to configure the timeout. A third person opened a feature request that's really a duplicate of an issue from last year. You can't tell which one to touch first without reading every single one.

That's the problem. Not the bugs themselves — the triage debt. No labels means no filters, no filters means everything looks equally urgent, and everything equally urgent means nothing gets fixed.

This script fixes that. It fetches every open issue in a GitHub repo, sends each one to Claude with a structured classification prompt, and applies labels back via the GitHub API — type (bug, feature, question, docs) and priority (critical, high, medium, low). Run it once to catch up, or wire it into a GitHub Actions workflow to classify issues as they're opened.

The Pipeline

Three pieces, each isolated enough to debug independently.

PyGithub fetches open issues. Each issue becomes a prompt — title plus a truncated version of the body — sent to claude-sonnet-4-6. Claude returns structured JSON with a type, a priority, a confidence score, and a one-line reasoning string. PyGithub takes that output and applies labels back to the original issue.

The script runs about 1–2 seconds per issue, depending on body length.

Set Up

pip install anthropic PyGithub python-dotenv

Create a .env file for local testing:

ANTHROPIC_API_KEY=sk-ant-your-key-here
GITHUB_TOKEN=ghp_your-pat-here
GITHUB_REPO=owner/repo-name

The GITHUB_TOKEN needs repo scope. If you're running this in GitHub Actions, the runner provides GITHUB_TOKEN automatically — you won't need that secret manually.

# triage.py
import os
import json
import time
import argparse
from dotenv import load_dotenv

load_dotenv()

Fetching Issues from the Repo

PyGithub's get_issues() returns both issues and pull requests. Filter PRs out — you don't want to classify those.

from github import Github
from github.Issue import Issue

def get_open_issues(repo_name: str, token: str) -> list[Issue]:
    g = Github(token)
    repo = g.get_repo(repo_name)
    issues = repo.get_issues(state="open")
    return [i for i in issues if not i.pull_request]

The pull_request attribute is None on real issues and populated on PRs. That filter handles it.

One thing worth knowing: get_issues() is paginated. PyGithub lazily fetches pages as you iterate, so the list comprehension above will work fine on 500 issues without pulling everything into memory at once.

What Are We Asking Claude to Decide?

The prompt design matters more than anything else in this pipeline. You want Claude to return consistent, parseable output — not a narrative paragraph you have to parse with regex.

Define the schema explicitly in the system prompt:

SYSTEM_PROMPT = """You are a senior software engineer performing GitHub issue triage.

Classify each issue using this exact JSON schema — respond with JSON only, no explanation:

{
  "type": "bug" | "feature" | "question" | "docs" | "other",
  "priority": "critical" | "high" | "medium" | "low",
  "confidence": <float between 0.0 and 1.0>,
  "reasoning": "<one sentence>"
}

Priority definitions:
- critical: service outage, data loss, security vulnerability
- high: feature broken, significant user impact, no workaround
- medium: degraded behavior, inconvenient but functional
- low: cosmetic, minor enhancement, future consideration
"""

Spelling out what "critical" means is the difference between getting consistent results and getting "high" on everything because the model is being conservative.

The user message is the issue itself:

def build_user_message(title: str, body: str) -> str:
    truncated_body = (body or "")[:500]
    return f"Issue title: {title}\n\nIssue body:\n{truncated_body}"

Truncating the body to 500 characters is a judgment call. Most of the signal is in the first few sentences. Sending a 5,000-word reproduction guide doesn't improve classification — it just costs more tokens and occasionally confuses the model with too much detail.

Calling Claude

import anthropic

client = anthropic.Anthropic()

def classify_issue(title: str, body: str) -> dict | None:
    try:
        message = client.messages.create(
            model="claude-sonnet-4-6",  # or claude-opus-4-8 for complex repos
            max_tokens=256,
            system=SYSTEM_PROMPT,
            messages=[
                {"role": "user", "content": build_user_message(title, body)}
            ],
        )
        raw = message.content[0].text.strip()
        return json.loads(raw)
    except json.JSONDecodeError:
        print(f"  [warn] JSON parse failed for: {title[:60]}")
        return None
    except Exception as e:
        print(f"  [error] {e}")
        return None

max_tokens=256 is plenty for a JSON object this size. Setting it low is intentional — it prevents runaway completions if the model ignores the "JSON only" instruction and starts explaining itself.

Applying Labels Back to GitHub

Two steps: create the label if it doesn't exist, then set it on the issue.

from github import GithubException
from github.Repository import Repository

LABEL_COLORS: dict[str, str] = {
    # Type labels — terminal green family
    "bug": "d73a4a",
    "feature": "a2eeef",
    "question": "d876e3",
    "docs": "0075ca",
    "other": "e4e669",
    # Priority labels — amber/warm family
    "critical": "b60205",
    "high": "e4460a",
    "medium": "fbca04",
    "low": "fef2c0",
}

def ensure_label(repo: Repository, name: str) -> None:
    try:
        repo.create_label(name=name, color=LABEL_COLORS.get(name, "ededed"))
    except GithubException as e:
        if e.status != 422:  # 422 = label already exists
            raise

def apply_labels(issue: Issue, classification: dict, repo: Repository) -> None:
    labels_to_apply = []

    issue_type = classification.get("type")
    priority = classification.get("priority")

    if issue_type:
        ensure_label(repo, issue_type)
        labels_to_apply.append(issue_type)

    if priority:
        label_name = f"priority:{priority}"
        ensure_label(repo, label_name)
        labels_to_apply.append(label_name)

    if labels_to_apply:
        issue.add_to_labels(*labels_to_apply)

Prefixing priority labels as priority:high rather than just high avoids collisions with existing labels in repos that already use high for something else. I've hit that issue on projects that pre-date any labeling convention.

Run It Against a Real Repo

def main() -> None:
    parser = argparse.ArgumentParser(description="AI-powered GitHub issue triage")
    parser.add_argument("--repo", default=os.getenv("GITHUB_REPO"))
    parser.add_argument("--dry-run", action="store_true", help="Print classifications, don't write labels")
    args = parser.parse_args()

    token = os.getenv("GITHUB_TOKEN")
    if not args.repo or not token:
        raise ValueError("Set GITHUB_REPO and GITHUB_TOKEN in .env or environment")

    from github import Github
    g = Github(token)
    repo = g.get_repo(args.repo)
    issues = get_open_issues(args.repo, token)

    print(f"Found {len(issues)} open issues in {args.repo}")

    for issue in issues:
        print(f"\n#{issue.number}: {issue.title[:70]}")
        result = classify_issue(issue.title, issue.body or "")

        if not result:
            print("  [skipped] classification failed")
            continue

        issue_type = result.get("type", "unknown")
        priority = result.get("priority", "unknown")
        confidence = result.get("confidence", 0.0)
        reasoning = result.get("reasoning", "")

        print(f"  type={issue_type}  priority={priority}  confidence={confidence:.2f}")
        print(f"  reasoning: {reasoning}")

        if not args.dry_run:
            apply_labels(issue, result, repo)
            print("  labels applied")

        # Respect GitHub's secondary rate limits on write operations
        time.sleep(0.5)

if __name__ == "__main__":
    main()

Dry-run first. Always.

python triage.py --repo owner/repo --dry-run

Once you're happy with how Claude is classifying, remove the flag:

python triage.py --repo owner/repo

The 0.5-second sleep between issues isn't about the Anthropic rate limit — it's about GitHub's secondary rate limits, which trigger on rapid label writes. GitHub doesn't tell you you're close; it just returns 403 errors.

What You're Actually Signing Up For

The prompt makes triage look like a solved problem. In practice, the same issue can read as a bug or a feature depending on how the author wrote it. "The delete button should ask for confirmation before removing records" — is that a missing feature or a bug in UX? Claude will pick one, and the confidence score won't tell you it was a borderline call. When the issue body is four words ("login page is broken"), the classifier is guessing.

Labels drift faster than you'd expect. Run this on 300 issues today, and six months from now someone redefines what "critical" means on your team. Now the old labels are wrong and the only fix is another classification pass. That's not a disaster, but it is ongoing work the tutorial doesn't mention.

I'd still use this over manual triage. The alternative is a dev spending two hours sorting through stale issues on a quarterly basis, and that dev is usually not thinking carefully about priority either. The script just makes the imprecision faster and more consistent. Whether "consistent imprecision" is better than "inconsistent imprecision with human context" is the part I'm genuinely not sure about.