Shamim Shams Search

Creating an AI Code Reviewer That Comments on GitHub Pull Requests

· 9 min read
Creating an AI Code Reviewer That Comments on GitHub Pull Requests

Your team ships fast. PRs stack up on Friday and get reviewed by Tuesday, if you're lucky. By then the author has context-switched twice and the reviewer is catching up from memory.

An AI reviewer won't replace a senior engineer. It won't catch the bug that requires knowing how a database migration two years ago left a nullable column in the wrong state. What it catches: raw SQL built with string formatting, bare except blocks that swallow errors silently, missing auth decorators, a list variable shadowing the built-in. That's the class of issue that slips through when someone's reading twenty files at once and getting tired.

This tutorial builds a Python script that fetches a PR diff from GitHub's API, sends it to GPT-4o for analysis, and posts structured feedback as a GitHub review — automatically, within seconds of a PR opening. You'll test the script locally first, then wire it to a GitHub Actions workflow that triggers on every new or updated PR.

How the Pipeline Works

The GitHub API provides the diff and accepts the review when it's done. A Python script in the middle fetches the diff, sends it to GPT-4o, formats the response, and posts it back. GitHub Actions triggers the script whenever a PR opens or receives new commits.

Three moving pieces, each isolated enough to test separately. The script runs in 10–15 seconds on a typical PR.

Setting Up

pip install openai requests python-dotenv

Create a .env for local testing:

OPENAI_API_KEY=sk-proj-your-key-here
GITHUB_TOKEN=ghp_your-pat-here
GITHUB_REPOSITORY=owner/repo

The GITHUB_TOKEN needs repo scope and pull-request write permission. In GitHub Actions the runner provides its own token automatically — you won't create that secret yourself.

Fetching the Diff

GitHub serves PR diffs directly if you set the right Accept header. The PR metadata endpoint (also needed for the head commit SHA) requires a separate header, so the function makes two requests:

# review_pr.py
import os
import sys
import json
import requests
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI()
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]


def get_pr_diff(repo: str, pr_number: int) -> tuple[str, str]:
    base_url = f"https://api.github.com/repos/{repo}/pulls/{pr_number}"
    auth = {"Authorization": f"Bearer {GITHUB_TOKEN}"}

    pr_info = requests.get(
        base_url,
        headers={**auth, "Accept": "application/vnd.github.v3+json"},
    )
    pr_info.raise_for_status()
    head_sha = pr_info.json()["head"]["sha"]

    diff_response = requests.get(
        base_url,
        headers={**auth, "Accept": "application/vnd.github.v3.diff"},
    )
    diff_response.raise_for_status()
    return diff_response.text, head_sha

You need the head_sha when submitting a review — GitHub ties the review to a specific commit so it remains accurate even as the branch updates.

The Review Prompt

JSON mode keeps the response parseable without regex gymnastics. temperature=0 makes the output consistent across repeated runs on the same diff — useful when you're testing the script against a known PR.

def review_diff(diff: str) -> dict:
    max_chars = 12_000
    if len(diff) > max_chars:
        diff = diff[:max_chars] + "\n\n[diff truncated — remaining files not reviewed]"

    prompt = f"""You are a senior software engineer reviewing a pull request.

Analyze this diff and look for:
- Bugs and logic errors
- Security issues (SQL injection, exposed secrets, missing auth, unvalidated input)
- Swallowed exceptions or missing error handling
- Performance problems (N+1 queries, unbounded loops)
- Code that would confuse a new maintainer

Return only JSON in this structure:
{{
  "summary": "One paragraph overall assessment.",
  "issues": [
    {{
      "severity": "critical | warning | suggestion",
      "file": "path/to/file.ext",
      "description": "What the problem is and how to fix it."
    }}
  ],
  "approved": true | false
}}

Set approved to false if any critical issues exist.

Diff:
{diff}"""

    response = client.chat.completions.create(
        model="gpt-4o",  # or gpt-4-turbo for lower cost
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(response.choices[0].message.content)

The 12,000-character truncation is conservative. GPT-4o's context window is much larger, but longer diffs produce vaguer, less targeted feedback. A focused review of the first 200 changed lines is more useful than a vague overview of 1,000.

Posting the Review

GitHub's review API distinguishes three event types: APPROVE, REQUEST_CHANGES, and COMMENT. The first two affect merge eligibility. Since this is automated, REQUEST_CHANGES only fires when the model finds critical issues — anything else goes in as COMMENT so developers don't spend their day dismissing bot blocks.

def format_review_body(review: dict) -> str:
    severity_labels = {
        "critical": "CRITICAL",
        "warning": "WARNING",
        "suggestion": "SUGGESTION",
    }
    lines = ["## AI Code Review\n", review["summary"], ""]

    if review["issues"]:
        lines.append("### Issues\n")
        for issue in review["issues"]:
            label = severity_labels.get(issue["severity"], issue["severity"].upper())
            lines.append(f"**[{label}] `{issue['file']}`**")
            lines.append(f"{issue['description']}\n")
    else:
        lines.append("No significant issues found.")

    status = "Approved" if review.get("approved") else "Changes requested"
    lines.append(f"\n**Status:** {status}")
    return "\n".join(lines)


def post_review(repo: str, pr_number: int, head_sha: str, review: dict) -> None:
    has_critical = any(i["severity"] == "critical" for i in review.get("issues", []))

    if review.get("approved") and not has_critical:
        event = "APPROVE"
    elif has_critical:
        event = "REQUEST_CHANGES"
    else:
        event = "COMMENT"

    payload = {
        "commit_id": head_sha,
        "body": format_review_body(review),
        "event": event,
    }
    url = f"https://api.github.com/repos/{repo}/pulls/{pr_number}/reviews"
    headers = {
        "Authorization": f"Bearer {GITHUB_TOKEN}",
        "Accept": "application/vnd.github.v3+json",
    }
    response = requests.post(url, json=payload, headers=headers)
    response.raise_for_status()
    print(f"Review posted: {response.json()['html_url']}")

Putting It Together

def main() -> None:
    repo = os.environ.get("GITHUB_REPOSITORY") or sys.argv[1]
    pr_number = int(os.environ.get("PR_NUMBER") or sys.argv[2])

    print(f"Reviewing PR #{pr_number} in {repo}...")
    diff, head_sha = get_pr_diff(repo, pr_number)
    review = review_diff(diff)
    post_review(repo, pr_number, head_sha, review)


if __name__ == "__main__":
    main()

Test it against a real PR before touching GitHub Actions:

python review_pr.py owner/your-repo 42

If the script exits cleanly and prints a URL, you're ready. Open the PR and check the Reviews tab.

Wiring It to GitHub Actions

Create .github/workflows/ai-review.yml:

name: AI Code Review

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  review:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: pip install openai requests python-dotenv

      - name: Run AI review
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          GITHUB_REPOSITORY: ${{ github.repository }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
        run: python review_pr.py

Add OPENAI_API_KEY to your repository secrets under Settings → Secrets and variables → Actions. GITHUB_TOKEN is injected automatically by the runner.

The permissions: pull-requests: write block is not optional. Without it the POST returns 403 with a misleading error — worth knowing before you spend twenty minutes checking your API key.

Does It Actually Catch Real Bugs?

Sometimes. What it catches depends on what you're throwing at it.

It's reliable for: SQL built with f-strings or % formatting, bare except blocks that silently swallow everything, hardcoded tokens accidentally committed, functions that return inconsistent types depending on a branch, off-by-one errors that are obvious on inspection. Things that should fail linting but don't.

One real example: I ran this against a Django codebase's PR that added a new view. The model caught a missing @login_required decorator — a security issue that's easy to miss when you're skimming fifteen files. It also flagged a variable named list that shadowed the built-in. Both were valid. It also flagged a three-line helper as "complex and worth extracting." That one wasn't.

It misses semantic bugs entirely. If a function is logically wrong — produces bad output under conditions that require domain knowledge to recognize — the model won't know. It also struggles when the diff mixes boilerplate (generated migrations, config files) with the actual logic change; the signal degrades when the model is reading noise.

Extending to Inline Comments

The current setup posts one review comment at the PR level. GitHub's API also supports inline comments on specific diff lines — more precise, and what you're probably used to from human reviewers.

Inline comments need a position value: the line number within the unified diff, not the file. You have to parse the @@ -x,y +a,b @@ headers, count lines forward from there, and map the model's output (file + file-line) to a diff position. It's about 40 lines of parsing code and it's fiddly — GitHub's definition of "position" has tripped up more than a few reviewer bots.

For most teams, the top-level review is enough to start. If you want inline comments later, extend review_diff() to return a line number per issue, write a diff_position_for(path, line) parser against the raw diff text, and pass an inline comments array to the review payload.

What You're Actually Signing Up For

The pipeline runs. What the tutorial doesn't mention is what happens two months in.

The first problem is noise. GPT-4o has opinions about style, and it will surface them. A developer who's been writing Python for ten years doesn't love seeing "consider extracting this into a helper function" on a perfectly readable five-line block. The first time, they'll engage with the review. The third time, they'll start dismissing it on reflex. You'll need to tune the prompt — probably several iterations — before the signal-to-noise ratio reaches a place where the team actually trusts it.

The second: the model occasionally hallucinates a problem. I've seen it describe a bug in a line that doesn't exist in the diff. These are easy to spot and dismiss, but they erode trust faster than any stylistic false positive. Keeping diffs focused and the prompt specific reduces this; it doesn't eliminate it.

I still think it's worth running. The catches that matter — missing auth checks, exposed secrets, swallowed exceptions — are exactly what human reviewers miss when they're tired or rushed. That's when the API cost earns its keep.