Build a RAG App in Python Without LangChain

Most RAG tutorials start with pip install langchain and end with code you cannot debug. You have a working chatbot, but you don't understand how it retrieves context, why it sometimes hallucinates, or how to improve it.

This tutorial takes a different approach. You build a RAG app from scratch — no LangChain, no LlamaIndex, no vector database. Just Python, Google Gemini, and about 100 lines of code.

By the end, you will have a command-line app that loads any PDF and answers questions about it.

What you will build

A Python script that does four things:

Extracts text from a PDF
Splits it into overlapping chunks
Converts each chunk into a vector embedding
Finds the most relevant chunks for a question and generates an answer

You run it like this:

python app.py invoice.pdf "What is the total amount due?"

And get back an answer grounded in the actual PDF content — not hallucinated.

What is RAG and why build it from scratch?

RAG stands for Retrieval-Augmented Generation. It solves a fundamental problem with large language models: they make things up.

An LLM has no access to your documents. Ask it about your company's refund policy or last quarter's revenue, and it will either refuse to answer or confidently generate something wrong.

RAG fixes this by adding a retrieval step before generation:

Retrieve — find the most relevant passages from your documents
Augment — inject those passages into the prompt as context
Generate — the model answers using only the provided context

When you build this with a framework, the retrieval step is a black box. When you build it from scratch, you understand exactly how your app finds information. That understanding is what lets you debug problems and improve accuracy later.

Do you actually need RAG?

Before you build anything, ask yourself: does this problem require retrieval at all?

Gemini 2.5 Pro accepts up to 1 million tokens in a single prompt. That is roughly 3,000 pages of text. If your document fits in the context window, you can skip the entire retrieval pipeline and paste the full text directly into the prompt.

When to skip RAG and use the full context window:

Your document is under 100 pages
You need the model to reason across the entire document, not just a few passages
You ask broad questions like "summarize this report" that require full context

When RAG is the right choice:

Your corpus is too large for the context window (multiple documents, databases, knowledge bases)
You need fast, cheap answers — embedding search is faster and cheaper than sending 500 pages per query
You want the model to cite specific passages, not synthesize from everything
You need to update your knowledge base without re-processing the entire corpus

For this tutorial, RAG is the right tool. You are building a system that can handle any PDF, cache its embeddings, and answer multiple questions without re-processing. That pattern scales to thousands of documents. Pasting the full text into a prompt does not.

Prerequisites

Before you start, make sure you have:

Python 3.9 or later
A Google AI Studio account (free tier works)
A PDF file to test with (any document will work)

Set up the project

Create a project folder and virtual environment:

mkdir pdf-rag && cd pdf-rag
python -m venv venv
source venv/bin/activate

Install four dependencies:

pip install pypdf google-genai numpy python-dotenv

Here is what each package does:

pypdf — reads PDF files and extracts text
google-genai — calls the Gemini API for embeddings and text generation
numpy — handles vector math for cosine similarity
python-dotenv — loads your API key from a .env file

Create a .env file with your Gemini API key:

echo "GEMINI_API_KEY=your_key_here" > .env
echo ".env" >> .gitignore

Replace your_key_here with an actual API key from Google AI Studio.

Create a file called app.py. All the code goes in this single file.

Step 1: Extract text from the PDF

The first function reads a PDF and returns all its text as a single string.

import pypdf
 
def extract_text(pdf_path):
    """Extract all text from a PDF and return it as a single string."""
    reader = pypdf.PdfReader(pdf_path)
    pages = [page.extract_text() for page in reader.pages]
    return "\n".join(pages)

PdfReader opens the file. The list comprehension calls extract_text() on each page. Then you join all pages with newlines.

Step 2: Split text into chunks

You cannot send an entire PDF to the embedding model. Token limits exist, and even within those limits, smaller chunks produce better search results.

The strategy: split the text into 500-character chunks with a 100-character overlap. The overlap ensures that sentences split across chunk boundaries still appear in at least one chunk.

def chunk_text(text, chunk_size=500, overlap=100):
    """Split text into overlapping chunks of fixed size."""
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i : i + chunk_size])
    return chunks

The step size is chunk_size - overlap (400 characters). Each chunk starts 400 characters after the previous one but grabs 500 characters — creating a 100-character overlap with the next chunk.

Step 3: Generate embeddings

An embedding is a list of numbers that captures the meaning of a piece of text. Similar texts produce similar numbers. This is what makes search possible — you find relevant chunks by comparing their embeddings to the question's embedding.

First, set up the Gemini client:

import os
from dotenv import load_dotenv
from google import genai
from google.genai import types
 
def create_client():
    """Load the API key from the environment and return a Gemini client."""
    load_dotenv()
    api_key = os.getenv("GEMINI_API_KEY")
    client = genai.Client(api_key=api_key)
    return client

Now write a function that embeds a single piece of text:

def embed_text(client, text):
    """Embed a single text string and return its vector."""
    result = client.models.embed_content(
        model="gemini-embedding-001",
        contents=text,
        config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT"),
    )
    return result.embeddings[0].values

The task_type="RETRIEVAL_DOCUMENT" tells the model this text is a document being indexed, not a search query. This distinction matters — the model optimizes the embedding differently for documents versus queries.

To embed all chunks, process them in batches. The Gemini free tier has rate limits, so you add a pause between batches:

import time
 
def embed_all_chunks(client, chunks):
    """Embed every chunk in batches to respect the free tier rate limit."""
    BATCH_SIZE = 90
    embeddings = []
    for i in range(0, len(chunks), BATCH_SIZE):
        batch = chunks[i : i + BATCH_SIZE]
        for chunk in batch:
            embeddings.append(embed_text(client, chunk))
        if i + BATCH_SIZE < len(chunks):
            print("Rate limit pause — waiting 60 seconds...")
            time.sleep(60)
    return embeddings

Step 4: Search with cosine similarity

Now you have a list of chunks and a matching list of embeddings. To answer a question, you need to find which chunks are most relevant.

Cosine similarity measures how similar two vectors are. It returns a score from -1 to 1, where 1 means identical direction (same meaning) and 0 means unrelated.

The formula divides the dot product of two vectors by the product of their lengths (norms):

import numpy as np
 
def cosine_similarity(vec_a, vec_b):
    """Return the cosine similarity between two vectors."""
    dot = np.dot(vec_a, vec_b)
    norm = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
    return dot / norm

The search function embeds the question, scores every chunk against it, sorts by score, and returns the top results:

def search(client, query, chunks, embeddings, top_k=3):
    """Return the top_k most relevant chunks for the given query."""
    result = client.models.embed_content(
        model="gemini-embedding-001",
        contents=query,
        config=types.EmbedContentConfig(task_type="RETRIEVAL_QUERY"),
    )
    query_vector = result.embeddings[0].values
 
    scores = [
        (cosine_similarity(query_vector, emb), chunk)
        for emb, chunk in zip(embeddings, chunks)
    ]
    scores.sort(key=lambda x: x[0], reverse=True)
    return [chunk for _, chunk in scores[:top_k]]

Notice the task_type changed to "RETRIEVAL_QUERY". The model generates a different embedding for search queries than for documents. This asymmetry improves retrieval accuracy.

When retrieval fails

Vector search works well for meaning-based questions like "What is the refund policy?" But it has a blind spot: exact matches.

Suppose your PDF contains an invoice with the line "Invoice #4521 — Total: $2,340.00" and you ask:

"What is the total for invoice #4521?"

Cosine similarity compares meaning, not keywords. The embedding for your question is close to any chunk that discusses invoices and totals — not necessarily the chunk that contains #4521. If multiple invoices exist in the document, the search might return the wrong one.

This happens because embeddings compress text into a fixed-size vector. Specific identifiers like invoice numbers, dates, and product codes lose their distinctiveness.

You will hit the same problem with:

Names and IDs — "What did John Smith order?" might match any chunk about orders
Exact figures — "Which quarter had $4.2M revenue?" might match any chunk about revenue
Code references — "What does function parse_config do?" might match any chunk about parsing

The fix is hybrid search: combine vector similarity with keyword matching (BM25). The keyword search catches exact terms that embeddings miss. Production RAG systems use both and merge the results. This tutorial keeps things simple with vector-only search, but understanding this limitation helps you debug retrieval problems later.

Step 5: Build the prompt and generate an answer

You now have the relevant chunks. The final step is to build a prompt that includes these chunks as context and ask Gemini to answer the question.

def build_prompt(question, context_chunks):
    """Assemble a RAG prompt from the question and retrieved chunks."""
    context = "\n\n".join(context_chunks)
    prompt = (
        "You are a helpful assistant. Answer the question using only the "
        "context below.\n"
        'If the answer is not in the context, say "I don\'t know."\n\n'
        f"Context:\n{context}\n\n"
        f"Question:\n{question}"
    )
    return prompt

The instruction "Answer the question using only the context below" is what grounds the model. Without it, the model might ignore your chunks and generate an answer from its training data.

The "I don't know" instruction prevents hallucination. If the relevant information is not in the retrieved chunks, the model should admit it rather than guess.

Now generate the answer:

def generate_answer(client, prompt):
    """Send the prompt to Gemini and return the generated answer."""
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=prompt,
    )
    return response.text

Step 6: Add caching

Embedding a large PDF takes time and API calls. You do not want to re-embed the same document every time you ask a question. A simple JSON cache solves this:

import json
 
def save_embeddings(chunks, embeddings, cache_path):
    """Save chunks and embeddings to a JSON file."""
    data = {"chunks": chunks, "embeddings": embeddings}
    with open(cache_path, "w") as f:
        json.dump(data, f)
 
def load_embeddings(cache_path):
    """Load chunks and embeddings from a JSON file."""
    if not os.path.exists(cache_path):
        return None
    with open(cache_path) as f:
        data = json.load(f)
    return data["chunks"], data["embeddings"]

Step 7: Wire it all together

The main() function ties every piece together:

import sys
 
def main():
    pdf_path = sys.argv[1]
    question = sys.argv[2]
    cache_path = pdf_path + ".cache.json"
    client = create_client()
 
    print(f"Loading {pdf_path}...")
    cached = load_embeddings(cache_path)
    if cached:
        chunks, embeddings = cached
        print(f"Loaded cache from {cache_path}")
    else:
        text = extract_text(pdf_path)
        chunks = chunk_text(text)
        print(f"No cache found. Embedding {len(chunks)} chunks...")
        embeddings = embed_all_chunks(client, chunks)
        save_embeddings(chunks, embeddings, cache_path)
        print(f"Cache saved to {cache_path}")
 
    top_chunks = search(client, question, chunks, embeddings)
    prompt = build_prompt(question, top_chunks)
    answer = generate_answer(client, prompt)
 
    print("\nAnswer:")
    print(answer)
 
if __name__ == "__main__":
    main()

The flow:

Read the PDF path and question from command-line arguments
Check for a cached embedding file
If no cache exists, extract text, chunk it, embed it, and save the cache
Search for the most relevant chunks
Build a prompt with those chunks as context
Generate and print the answer

Run it

Grab any PDF and try it:

python app.py report.pdf "What were the key findings?"

First run takes longer because it embeds every chunk. Subsequent questions about the same PDF use the cache and answer in seconds.

The full code

Here is the complete app.py — about 100 lines of actual logic:

import os
import sys
import json
import time
import pypdf
import numpy as np
from dotenv import load_dotenv
from google import genai
from google.genai import types
 
 
def extract_text(pdf_path):
    """Extract all text from a PDF and return it as a single string."""
    reader = pypdf.PdfReader(pdf_path)
    pages = [page.extract_text() for page in reader.pages]
    return "\n".join(pages)
 
 
def chunk_text(text, chunk_size=500, overlap=100):
    """Split text into overlapping chunks of fixed size."""
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i : i + chunk_size])
    return chunks
 
 
def create_client():
    """Load the API key from the environment and return a Gemini client."""
    load_dotenv()
    api_key = os.getenv("GEMINI_API_KEY")
    client = genai.Client(api_key=api_key)
    return client
 
 
def embed_text(client, text):
    """Embed a single text string and return its vector."""
    result = client.models.embed_content(
        model="gemini-embedding-001",
        contents=text,
        config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT"),
    )
    return result.embeddings[0].values
 
 
def embed_all_chunks(client, chunks):
    """Embed every chunk in batches to respect the free tier rate limit."""
    BATCH_SIZE = 90
    embeddings = []
    for i in range(0, len(chunks), BATCH_SIZE):
        batch = chunks[i : i + BATCH_SIZE]
        for chunk in batch:
            embeddings.append(embed_text(client, chunk))
        if i + BATCH_SIZE < len(chunks):
            print("Rate limit pause — waiting 60 seconds...")
            time.sleep(60)
    return embeddings
 
 
def cosine_similarity(vec_a, vec_b):
    """Return the cosine similarity between two vectors."""
    dot = np.dot(vec_a, vec_b)
    norm = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
    return dot / norm
 
 
def search(client, query, chunks, embeddings, top_k=3):
    """Return the top_k most relevant chunks for the given query."""
    result = client.models.embed_content(
        model="gemini-embedding-001",
        contents=query,
        config=types.EmbedContentConfig(task_type="RETRIEVAL_QUERY"),
    )
    query_vector = result.embeddings[0].values
    scores = [
        (cosine_similarity(query_vector, emb), chunk)
        for emb, chunk in zip(embeddings, chunks)
    ]
    scores.sort(key=lambda x: x[0], reverse=True)
    return [chunk for _, chunk in scores[:top_k]]
 
 
def build_prompt(question, context_chunks):
    """Assemble a RAG prompt from the question and retrieved chunks."""
    context = "\n\n".join(context_chunks)
    prompt = (
        "You are a helpful assistant. Answer the question using only the "
        "context below.\n"
        'If the answer is not in the context, say "I don\'t know."\n\n'
        f"Context:\n{context}\n\n"
        f"Question:\n{question}"
    )
    return prompt
 
 
def generate_answer(client, prompt):
    """Send the prompt to Gemini and return the generated answer."""
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=prompt,
    )
    return response.text
 
 
def save_embeddings(chunks, embeddings, cache_path):
    """Save chunks and embeddings to a JSON file."""
    data = {"chunks": chunks, "embeddings": embeddings}
    with open(cache_path, "w") as f:
        json.dump(data, f)
 
 
def load_embeddings(cache_path):
    """Load chunks and embeddings from a JSON file."""
    if not os.path.exists(cache_path):
        return None
    with open(cache_path) as f:
        data = json.load(f)
    return data["chunks"], data["embeddings"]
 
 
def main():
    pdf_path = sys.argv[1]
    question = sys.argv[2]
    cache_path = pdf_path + ".cache.json"
    client = create_client()
 
    print(f"Loading {pdf_path}...")
    cached = load_embeddings(cache_path)
    if cached:
        chunks, embeddings = cached
        print(f"Loaded cache from {cache_path}")
    else:
        text = extract_text(pdf_path)
        chunks = chunk_text(text)
        print(f"No cache found. Embedding {len(chunks)} chunks...")
        embeddings = embed_all_chunks(client, chunks)
        save_embeddings(chunks, embeddings, cache_path)
        print(f"Cache saved to {cache_path}")
 
    top_chunks = search(client, question, chunks, embeddings)
    prompt = build_prompt(question, top_chunks)
    answer = generate_answer(client, prompt)
 
    print("\nAnswer:")
    print(answer)
 
 
if __name__ == "__main__":
    main()

Beyond vector search

The app you just built uses vector-only retrieval. That works for most questions, but as the "When retrieval fails" section showed, it struggles with exact keyword matches.

Production RAG systems solve this with hybrid search — running two searches in parallel and merging the results:

Vector search finds chunks that are semantically similar to the question. It handles paraphrasing, synonyms, and abstract queries well.
Keyword search (BM25) finds chunks that contain the exact words from the question. It handles names, IDs, codes, and specific figures well.

A reranker then scores the combined results. Cross-encoder models like ms-marco-MiniLM read the full question and each chunk together, producing a more accurate relevance score than either search method alone. Recent benchmarks show hybrid search with reranking improves accuracy by 30-40% over vector-only retrieval.

You do not need a vector database to add hybrid search. Python's rank_bm25 library handles keyword scoring in about 10 lines of code. Combined with the cosine similarity search you already built, you get a production-grade retrieval pipeline without any framework.

Here are two more improvements worth exploring:

Smarter chunking — split on paragraph boundaries instead of fixed character counts. This avoids cutting sentences in half.
Multiple PDFs — load several documents into the same embedding cache and search across all of them.

Go deeper with the interactive course

This tutorial gives you the code. The Build a RAG App — Chat with Your PDFs interactive course gives you the full learning experience: step-by-step instructions, an in-browser code editor, and instant validation as you build each function.

You write the code yourself, one function at a time. The course checks your work after every step. Five lessons, 26 chapters, about 3 hours from start to finish.