RAG (Retrieval-Augmented Generation) Explained with Real-World Examples

Both approaches solve the same problem: getting an LLM to do exactly what you want. They solve it in completely different ways, at completely different costs, and one of them is almost always the wrong choice for what you're actually trying to do.

What This Covers

The practical decision framework for when to prompt-engineer your way to the output you need vs. when fine-tuning is actually justified. Includes real cost comparisons, working code examples, and the specific signals that tell you which path to take.

Prerequisites

Comfortable calling LLM APIs (OpenAI, Anthropic, or similar)
Basic Python — the code here is straightforward
No ML background required

What Prompt Engineering Actually Is

Prompt engineering is everything you do at inference time to shape the model's output: system prompts that set context and persona, few-shot examples embedded in the prompt, chain-of-thought instructions, output format constraints, role framing, and tone instructions. The model's weights never change. You're working with the model as-is, steering it through the context window.

Here's a concrete example — extracting structured data from a customer support message:

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You extract structured data from customer support messages.
Always respond with valid JSON matching this schema:
{
  "issue_type": "billing | technical | general",
  "urgency": "low | medium | high",
  "summary": "one sentence"
}

Never include explanation outside the JSON object."""

EXAMPLE_MESSAGE = "My payment failed twice and now I can't log in at all. I have a meeting in two hours where I need to demo this."

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=256,
    system=SYSTEM_PROMPT,
    messages=[
        {"role": "user", "content": EXAMPLE_MESSAGE}
    ]
)

print(response.content[0].text)

That system prompt — plus the format constraint — is doing a lot of work. In most production cases, that's enough.

What Fine-Tuning Actually Is

Fine-tuning updates the model's weights using your training data. You start with a pre-trained base model and run additional gradient descent steps on your examples. The result is a new model checkpoint that has internalized patterns from your dataset.

OpenAI's fine-tuning API is the most accessible entry point. A minimal training file (JSONL format) looks like this:

{"messages": [{"role": "system", "content": "You are a support agent for Acme Corp."}, {"role": "user", "content": "How do I reset my password?"}, {"role": "assistant", "content": "Head to Settings > Security > Reset Password. You'll get an email within 2 minutes."}]}
{"messages": [{"role": "system", "content": "You are a support agent for Acme Corp."}, {"role": "user", "content": "Where do I find my invoices?"}, {"role": "assistant", "content": "Invoices live under Billing > Invoice History. You can download PDFs directly from there."}]}

Kicking off a training job:

from openai import OpenAI

client = OpenAI()

with open("training_data.jsonl", "rb") as f:
    file = client.files.create(file=f, purpose="fine-tune")

job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18"
)

print(f"Job ID: {job.id}")

Training usually takes 30–60 minutes for small datasets. Once done, you get a model ID like ft:gpt-4o-mini:your-org:custom-name:abc123 and call it exactly like the base model.

The catch: you need data. OpenAI recommends at least 50–100 examples to see any change, and 500–1,000+ to reliably beat a well-crafted prompt. Reaching 10,000 high-quality labeled examples — where fine-tuning really starts pulling ahead — is a non-trivial data collection problem.

The Real Cost Comparison

Fine-tuning looks cheap on paper because inference costs less per call. The actual cost includes more:

Cost Type	Prompt Engineering	Fine-Tuning
Upfront	$0	Training compute ($2–$50+ per run)
Data collection	$0	High (time + annotation)
Iteration speed	Minutes	Hours per run
Inference cost	Higher (longer prompts)	Lower (less context needed)
Break-even volume	N/A	~100k+ calls/month

If you're calling the API fewer than 100,000 times per month, prompt engineering is almost certainly cheaper end-to-end. The inference cost delta only becomes significant at scale.

There's also a maintenance cost: a fine-tuned model is pinned to a specific base model version. When the provider deprecates that version, you retrain. Prompt engineering adapts the moment you update the system prompt.

When Prompt Engineering Is the Right Call

If you're still figuring out what you want the model to do, don't reach for fine-tuning. Prompt changes take seconds; retraining takes hours.

Ninety percent of "the model doesn't do what I want" problems are solved by being more explicit. If the model keeps responding in paragraph form when you want JSON, put RESPOND ONLY WITH VALID JSON. NO EXPLANATION. in all-caps in the system prompt. Works more often than it should.

Fine-tuning also struggles with varied inputs. It excels when the task distribution is narrow and consistent. If your use case covers diverse questions, contexts, or formats, prompt engineering handles that cleanly without extra work.

The operational overhead of fine-tuning is real: data pipelines, training jobs, model versioning, regression testing. A two-person team shipping fast usually can't absorb that without slowing down.

When Fine-Tuning Is Actually Justified

Fine-tuning earns its cost in specific situations.

When tone and style consistency matters more than anything else — a brand voice that can't be adequately described in a system prompt because it's a feel, not a rule set — showing the model hundreds of examples works better than describing the style in words. You know it when you try to write the system prompt and realize you can't.

When you have proprietary domain vocabulary: medical records, legal documents, internal codebases, niche technical domains. The base model doesn't know your company's product names or field-specific abbreviations. Fine-tuning internalizes them; a system prompt that defines every term gets unwieldy fast.

When you're running the same narrow task at high volume. A fine-tuned smaller model (like GPT-4o mini) can match a larger prompted model's quality at lower inference cost. The break-even depends on your volume and how much prompt context you're currently sending, but it usually lands around 100,000 calls per month.

When latency is a hard constraint. Fine-tuned models on smaller bases respond faster than larger models with lengthy system prompts. In real-time applications, that gap matters.

The Decision Framework

Before touching a training dataset, work through this in order:

Does a better system prompt fix the problem? Spend two hours iterating before anything else.
Do few-shot examples close the gap? Add 3–5 worked examples directly in the prompt.
Is this a knowledge problem or a behavior problem? Knowledge gaps usually call for RAG, not fine-tuning. Behavior issues — tone, format, style — are what fine-tuning actually addresses.
Do you have 500+ high-quality labeled examples? Below that threshold, fine-tuning won't reliably outperform a good prompt.
Are you running this at a volume where the inference cost delta matters? Run the numbers.

If you get through all five and the answer still points to fine-tuning, you have a genuine use case. Most problems don't make it past step two.

Wrapping Up

The pattern I see most: someone spends a week building a fine-tuning pipeline, trains on 200 examples, and ends up with results that a 15-minute prompt engineering session would have matched. Fine-tuning is a real lever — it just works best after you've maxed out what prompt engineering can do and have the data volume to back it up.

Start with the prompt. Add examples. If you're still hitting a ceiling after that, check whether your problem fits the fine-tuning profile: consistent narrow task, high volume, proprietary style or vocabulary, hard latency constraints.

One more thing worth knowing: if the issue is domain vocabulary or knowledge gaps, embeddings-based retrieval (RAG) often beats fine-tuning at a fraction of the cost — and keeps your knowledge base updatable without rerunning training jobs.