Automating Data Extraction from PDFs Using Python and LLMs
PDFs lie to you. They look structured — columns aligned, tables formatted, field labels sitting neatly next to their values — but underneath that visual order is a flat stream of text with no semantic structure at all. Ask a PDF parser what the invoice total is and it'll give you every number on the page.
You'll need Python 3.10+, an OpenAI API key, and openai, pymupdf, and pydantic installed.
Why Rules-Based Parsing Breaks
Every extraction project starts the same way: write a regex, test it on 10 PDFs, celebrate. Then someone uploads a document from a different vendor and the whole thing falls apart.
"Invoice total" shows up as "Total Amount Due", "Grand Total", "Amount Payable", "Net Amount", or just a bold number at the bottom with a dollar sign. Across 50 vendors you're maintaining 50 parsing rules, and every new document format is a bug report waiting to happen.
LLMs don't care about formatting variation. You describe what you want in plain language, they extract it. The same prompt that handles a clean corporate invoice handles a scan from a handwritten receipt — as long as the text is readable.
Text Extraction First
Before the LLM can do anything, you need readable text. pymupdf is the most reliable option for digital PDFs:
pip install pymupdf openai pydantic
import fitz # pymupdf
from pathlib import Path
def extract_text(pdf_path: str) -> str:
doc = fitz.open(pdf_path)
pages = [page.get_text() for page in doc]
doc.close()
return "\n\n".join(pages)
For scanned PDFs, you'll need OCR — pytesseract paired with pdf2image handles that — but for this article I'm focused on digital PDFs where text is already embedded.
fitz.open() holds a file handle. Always call doc.close() or use a context manager. On large batch jobs, leaving documents open exhausts your file descriptor limit faster than you'd expect.
Tell the Model What You Need
Structured extraction works best when you're explicit about the fields. Pydantic makes this concrete:
from pydantic import BaseModel, Field
from typing import Optional
class InvoiceData(BaseModel):
vendor_name: str = Field(description="Company or individual who issued the invoice")
invoice_number: str = Field(description="Invoice ID or reference number")
invoice_date: str = Field(description="Date the invoice was issued, in original format")
due_date: Optional[str] = Field(default=None, description="Payment due date if present")
subtotal: Optional[float] = Field(default=None, description="Amount before tax")
tax_amount: Optional[float] = Field(default=None, description="Tax amount if shown")
total_amount: float = Field(description="Final total amount due")
currency: str = Field(default="USD", description="Currency of the amounts")
The description fields aren't just documentation — they're instructions. The model reads them when mapping document content to fields. "amount" without a description will have the model guessing which of the four numbers on the page you want. Anything ambiguous produces inconsistent results across document types.
Structured Output with OpenAI
You have a 50-page report and you need the vendor, the total, and the due date. The naive approach is to ask the model in plain text and parse the response yourself.
import openai
from typing import TypeVar, Type
client = openai.OpenAI()
T = TypeVar("T", bound=BaseModel)
def extract_structured(text: str, schema: Type[T], context: str = "") -> T:
system_msg = "You extract structured data from documents. Return only what's present — don't invent data."
user_msg = f"""Extract the following information from this document.
{f"Context: {context}" if context else ""}
Document text:
{text[:8000]}
If a field is not present, use null."""
response = client.beta.chat.completions.parse(
model="gpt-4o", # or gpt-4o-mini for simple single-page docs
messages=[
{"role": "system", "content": system_msg},
{"role": "user", "content": user_msg}
],
response_format=schema,
temperature=0,
)
return response.choices[0].message.parsed
client.beta.chat.completions.parse is OpenAI's structured output endpoint — it guarantees valid JSON matching your Pydantic schema. If the model can't fill a required field, it raises an error rather than returning malformed data.
Temperature zero matters here.
Extraction isn't creative work. Even small temperature values introduce unnecessary variation in how the model maps fields across otherwise identical documents.
Run It Against a Directory
You have 200 invoices from 12 different vendors. None of them use the same layout.
import json
from pathlib import Path
def process_invoice(pdf_path: str) -> dict:
text = extract_text(pdf_path)
if not text.strip():
raise ValueError(f"No readable text extracted from {pdf_path}")
result = extract_structured(
text=text,
schema=InvoiceData,
context="This is an invoice or billing document"
)
return {
"file": Path(pdf_path).name,
"data": result.model_dump(),
"status": "success"
}
def batch_process(pdf_dir: str) -> list[dict]:
results = []
for pdf_path in Path(pdf_dir).glob("*.pdf"):
try:
results.append(process_invoice(str(pdf_path)))
except Exception as e:
results.append({
"file": pdf_path.name,
"data": None,
"status": "error",
"error": str(e)
})
return results
if __name__ == "__main__":
results = batch_process("./invoices")
print(json.dumps(results, indent=2))
Error handling is deliberately broad — Exception catches both API failures and document errors. In production you'd separate those: retry on transient API errors, log and skip on document errors.
When Does This Break?
Three failure modes come up in practice.
Tables are a pain. pymupdf extracts table content as a flat sequence of values — cell order isn't guaranteed, and merged cells produce garbage. For documents that are mostly tables (financial statements, data exports), use pymupdf's find_tables() method and pass the structured table data to the model separately from the prose.
Scanned PDFs return empty string. If extract_text() returns nothing, the PDF is image-based. You need OCR first. The quality of your OCR output directly limits your extraction quality — no model is smart enough to extract structured data from garbled text.
Long documents hit context limits. The 8000-character trim above is a blunt instrument. For multi-page contracts, a better approach is extracting page-by-page, running a lightweight check to identify which pages contain the relevant fields, then passing only those pages to the extraction call. More API calls, but you're not feeding 40 pages of boilerplate to a model looking for three fields.
Don't Trust the Output Blindly
I've shipped code that skipped validation. One vendor's invoice had a typo in the subtotal field — the model faithfully extracted the wrong number and it sat in the database for two weeks before anyone noticed.
A check that catches the most common issues before they reach your database:
def validate_invoice(data: InvoiceData) -> list[str]:
issues = []
if data.total_amount <= 0:
issues.append(f"Invalid total: {data.total_amount}")
if data.subtotal is not None and data.tax_amount is not None:
expected = data.subtotal + data.tax_amount
if abs(expected - data.total_amount) > 0.02:
issues.append(
f"Total mismatch: {data.subtotal} + {data.tax_amount} = {expected}, "
f"but extracted total is {data.total_amount}"
)
return issues
It doesn't catch everything. Missing totals, math that doesn't add up, negative amounts from OCR artifacts — those it handles. If validate_invoice() returns issues, flag the document for human review rather than writing to the database.
Wrapping Up
The pattern — extract text, define a schema, use structured output — generalizes across document types: contracts, receipts, medical forms, shipping manifests. The schema changes; the pipeline doesn't.
Where this gets genuinely hard is mixed-content documents: a contract with numbered clauses, data tables, and narrative prose all on the same page. I don't have a clean answer for that case. Start with the fields you actually query against — usually fewer than ten — and add more only when a real use case demands it. Trying to extract everything upfront produces schemas that are impossible to validate and barely more useful than the original PDF.
Share