Build a RAG App in Python Without LangChain
Build a retrieval-augmented generation app from scratch with Python, Google Gemini, and cosine similarity. No frameworks, no vector databases — just code you understand.
Most RAG tutorials start with pip install langchain and end with code you cannot debug. You have a working chatbot, but you don't understand how it retrieves context, why it sometimes hallucinates, or how to improve it.
This tutorial takes a different approach. You build a RAG app from scratch — no LangChain, no LlamaIndex, no vector database. Just Python, Google Gemini, and about 100 lines of code.
By the end, you will have a command-line app that loads any PDF and answers questions about it.
What you will build
A Python script that does four things:
- Extracts text from a PDF
- Splits it into overlapping chunks
- Converts each chunk into a vector embedding
- Finds the most relevant chunks for a question and generates an answer
You run it like this:
python app.py invoice.pdf "What is the total amount due?"And get back an answer grounded in the actual PDF content — not hallucinated.
What is RAG and why build it from scratch?
RAG stands for Retrieval-Augmented Generation. It solves a fundamental problem with large language models: they make things up.
An LLM has no access to your documents. Ask it about your company's refund policy or last quarter's revenue, and it will either refuse to answer or confidently generate something wrong.
RAG fixes this by adding a retrieval step before generation:
- Retrieve — find the most relevant passages from your documents
- Augment — inject those passages into the prompt as context
- Generate — the model answers using only the provided context
When you build this with a framework, the retrieval step is a black box. When you build it from scratch, you understand exactly how your app finds information. That understanding is what lets you debug problems and improve accuracy later.
Do you actually need RAG?
Before you build anything, ask yourself: does this problem require retrieval at all?
Gemini 2.5 Pro accepts up to 1 million tokens in a single prompt. That is roughly 3,000 pages of text. If your document fits in the context window, you can skip the entire retrieval pipeline and paste the full text directly into the prompt.
When to skip RAG and use the full context window:
- Your document is under 100 pages
- You need the model to reason across the entire document, not just a few passages
- You ask broad questions like "summarize this report" that require full context
When RAG is the right choice:
- Your corpus is too large for the context window (multiple documents, databases, knowledge bases)
- You need fast, cheap answers — embedding search is faster and cheaper than sending 500 pages per query
- You want the model to cite specific passages, not synthesize from everything
- You need to update your knowledge base without re-processing the entire corpus
For this tutorial, RAG is the right tool. You are building a system that can handle any PDF, cache its embeddings, and answer multiple questions without re-processing. That pattern scales to thousands of documents. Pasting the full text into a prompt does not.
Prerequisites
Before you start, make sure you have:
- Python 3.9 or later
- A Google AI Studio account (free tier works)
- A PDF file to test with (any document will work)
Set up the project
Create a project folder and virtual environment:
mkdir pdf-rag && cd pdf-rag
python -m venv venv
source venv/bin/activateInstall four dependencies:
pip install pypdf google-genai numpy python-dotenvHere is what each package does:
- pypdf — reads PDF files and extracts text
- google-genai — calls the Gemini API for embeddings and text generation
- numpy — handles vector math for cosine similarity
- python-dotenv — loads your API key from a
.envfile
Create a .env file with your Gemini API key:
echo "GEMINI_API_KEY=your_key_here" > .env
echo ".env" >> .gitignoreReplace your_key_here with an actual API key from Google AI Studio.
Create a file called app.py. All the code goes in this single file.
Step 1: Extract text from the PDF
The first function reads a PDF and returns all its text as a single string.
import pypdf
def extract_text(pdf_path):
"""Extract all text from a PDF and return it as a single string."""
reader = pypdf.PdfReader(pdf_path)
pages = [page.extract_text() for page in reader.pages]
return "\n".join(pages)PdfReader opens the file. The list comprehension calls extract_text() on each page. Then you join all pages with newlines.
Step 2: Split text into chunks
You cannot send an entire PDF to the embedding model. Token limits exist, and even within those limits, smaller chunks produce better search results.
The strategy: split the text into 500-character chunks with a 100-character overlap. The overlap ensures that sentences split across chunk boundaries still appear in at least one chunk.
def chunk_text(text, chunk_size=500, overlap=100):
"""Split text into overlapping chunks of fixed size."""
chunks = []
for i in range(0, len(text), chunk_size - overlap):
chunks.append(text[i : i + chunk_size])
return chunksThe step size is chunk_size - overlap (400 characters). Each chunk starts 400 characters after the previous one but grabs 500 characters — creating a 100-character overlap with the next chunk.
Step 3: Generate embeddings
An embedding is a list of numbers that captures the meaning of a piece of text. Similar texts produce similar numbers. This is what makes search possible — you find relevant chunks by comparing their embeddings to the question's embedding.
First, set up the Gemini client:
import os
from dotenv import load_dotenv
from google import genai
from google.genai import types
def create_client():
"""Load the API key from the environment and return a Gemini client."""
load_dotenv()
api_key = os.getenv("GEMINI_API_KEY")
client = genai.Client(api_key=api_key)
return clientNow write a function that embeds a single piece of text:
def embed_text(client, text):
"""Embed a single text string and return its vector."""
result = client.models.embed_content(
model="gemini-embedding-001",
contents=text,
config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT"),
)
return result.embeddings[0].valuesThe task_type="RETRIEVAL_DOCUMENT" tells the model this text is a document being indexed, not a search query. This distinction matters — the model optimizes the embedding differently for documents versus queries.
To embed all chunks, process them in batches. The Gemini free tier has rate limits, so you add a pause between batches:
import time
def embed_all_chunks(client, chunks):
"""Embed every chunk in batches to respect the free tier rate limit."""
BATCH_SIZE = 90
embeddings = []
for i in range(0, len(chunks), BATCH_SIZE):
batch = chunks[i : i + BATCH_SIZE]
for chunk in batch:
embeddings.append(embed_text(client, chunk))
if i + BATCH_SIZE < len(chunks):
print("Rate limit pause — waiting 60 seconds...")
time.sleep(60)
return embeddingsStep 4: Search with cosine similarity
Now you have a list of chunks and a matching list of embeddings. To answer a question, you need to find which chunks are most relevant.
Cosine similarity measures how similar two vectors are. It returns a score from -1 to 1, where 1 means identical direction (same meaning) and 0 means unrelated.
The formula divides the dot product of two vectors by the product of their lengths (norms):
import numpy as np
def cosine_similarity(vec_a, vec_b):
"""Return the cosine similarity between two vectors."""
dot = np.dot(vec_a, vec_b)
norm = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
return dot / normThe search function embeds the question, scores every chunk against it, sorts by score, and returns the top results:
def search(client, query, chunks, embeddings, top_k=3):
"""Return the top_k most relevant chunks for the given query."""
result = client.models.embed_content(
model="gemini-embedding-001",
contents=query,
config=types.EmbedContentConfig(task_type="RETRIEVAL_QUERY"),
)
query_vector = result.embeddings[0].values
scores = [
(cosine_similarity(query_vector, emb), chunk)
for emb, chunk in zip(embeddings, chunks)
]
scores.sort(key=lambda x: x[0], reverse=True)
return [chunk for _, chunk in scores[:top_k]]Notice the task_type changed to "RETRIEVAL_QUERY". The model generates a different embedding for search queries than for documents. This asymmetry improves retrieval accuracy.
When retrieval fails
Vector search works well for meaning-based questions like "What is the refund policy?" But it has a blind spot: exact matches.
Suppose your PDF contains an invoice with the line "Invoice #4521 — Total: $2,340.00" and you ask:
"What is the total for invoice #4521?"
Cosine similarity compares meaning, not keywords. The embedding for your question is close to any chunk that discusses invoices and totals — not necessarily the chunk that contains #4521. If multiple invoices exist in the document, the search might return the wrong one.
This happens because embeddings compress text into a fixed-size vector. Specific identifiers like invoice numbers, dates, and product codes lose their distinctiveness.
You will hit the same problem with:
- Names and IDs — "What did John Smith order?" might match any chunk about orders
- Exact figures — "Which quarter had $4.2M revenue?" might match any chunk about revenue
- Code references — "What does function
parse_configdo?" might match any chunk about parsing
The fix is hybrid search: combine vector similarity with keyword matching (BM25). The keyword search catches exact terms that embeddings miss. Production RAG systems use both and merge the results. This tutorial keeps things simple with vector-only search, but understanding this limitation helps you debug retrieval problems later.
Step 5: Build the prompt and generate an answer
You now have the relevant chunks. The final step is to build a prompt that includes these chunks as context and ask Gemini to answer the question.
def build_prompt(question, context_chunks):
"""Assemble a RAG prompt from the question and retrieved chunks."""
context = "\n\n".join(context_chunks)
prompt = (
"You are a helpful assistant. Answer the question using only the "
"context below.\n"
'If the answer is not in the context, say "I don\'t know."\n\n'
f"Context:\n{context}\n\n"
f"Question:\n{question}"
)
return promptThe instruction "Answer the question using only the context below" is what grounds the model. Without it, the model might ignore your chunks and generate an answer from its training data.
The "I don't know" instruction prevents hallucination. If the relevant information is not in the retrieved chunks, the model should admit it rather than guess.
Now generate the answer:
def generate_answer(client, prompt):
"""Send the prompt to Gemini and return the generated answer."""
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=prompt,
)
return response.textStep 6: Add caching
Embedding a large PDF takes time and API calls. You do not want to re-embed the same document every time you ask a question. A simple JSON cache solves this:
import json
def save_embeddings(chunks, embeddings, cache_path):
"""Save chunks and embeddings to a JSON file."""
data = {"chunks": chunks, "embeddings": embeddings}
with open(cache_path, "w") as f:
json.dump(data, f)
def load_embeddings(cache_path):
"""Load chunks and embeddings from a JSON file."""
if not os.path.exists(cache_path):
return None
with open(cache_path) as f:
data = json.load(f)
return data["chunks"], data["embeddings"]Step 7: Wire it all together
The main() function ties every piece together:
import sys
def main():
pdf_path = sys.argv[1]
question = sys.argv[2]
cache_path = pdf_path + ".cache.json"
client = create_client()
print(f"Loading {pdf_path}...")
cached = load_embeddings(cache_path)
if cached:
chunks, embeddings = cached
print(f"Loaded cache from {cache_path}")
else:
text = extract_text(pdf_path)
chunks = chunk_text(text)
print(f"No cache found. Embedding {len(chunks)} chunks...")
embeddings = embed_all_chunks(client, chunks)
save_embeddings(chunks, embeddings, cache_path)
print(f"Cache saved to {cache_path}")
top_chunks = search(client, question, chunks, embeddings)
prompt = build_prompt(question, top_chunks)
answer = generate_answer(client, prompt)
print("\nAnswer:")
print(answer)
if __name__ == "__main__":
main()The flow:
- Read the PDF path and question from command-line arguments
- Check for a cached embedding file
- If no cache exists, extract text, chunk it, embed it, and save the cache
- Search for the most relevant chunks
- Build a prompt with those chunks as context
- Generate and print the answer
Run it
Grab any PDF and try it:
python app.py report.pdf "What were the key findings?"First run takes longer because it embeds every chunk. Subsequent questions about the same PDF use the cache and answer in seconds.
The full code
Here is the complete app.py — about 100 lines of actual logic:
import os
import sys
import json
import time
import pypdf
import numpy as np
from dotenv import load_dotenv
from google import genai
from google.genai import types
def extract_text(pdf_path):
"""Extract all text from a PDF and return it as a single string."""
reader = pypdf.PdfReader(pdf_path)
pages = [page.extract_text() for page in reader.pages]
return "\n".join(pages)
def chunk_text(text, chunk_size=500, overlap=100):
"""Split text into overlapping chunks of fixed size."""
chunks = []
for i in range(0, len(text), chunk_size - overlap):
chunks.append(text[i : i + chunk_size])
return chunks
def create_client():
"""Load the API key from the environment and return a Gemini client."""
load_dotenv()
api_key = os.getenv("GEMINI_API_KEY")
client = genai.Client(api_key=api_key)
return client
def embed_text(client, text):
"""Embed a single text string and return its vector."""
result = client.models.embed_content(
model="gemini-embedding-001",
contents=text,
config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT"),
)
return result.embeddings[0].values
def embed_all_chunks(client, chunks):
"""Embed every chunk in batches to respect the free tier rate limit."""
BATCH_SIZE = 90
embeddings = []
for i in range(0, len(chunks), BATCH_SIZE):
batch = chunks[i : i + BATCH_SIZE]
for chunk in batch:
embeddings.append(embed_text(client, chunk))
if i + BATCH_SIZE < len(chunks):
print("Rate limit pause — waiting 60 seconds...")
time.sleep(60)
return embeddings
def cosine_similarity(vec_a, vec_b):
"""Return the cosine similarity between two vectors."""
dot = np.dot(vec_a, vec_b)
norm = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
return dot / norm
def search(client, query, chunks, embeddings, top_k=3):
"""Return the top_k most relevant chunks for the given query."""
result = client.models.embed_content(
model="gemini-embedding-001",
contents=query,
config=types.EmbedContentConfig(task_type="RETRIEVAL_QUERY"),
)
query_vector = result.embeddings[0].values
scores = [
(cosine_similarity(query_vector, emb), chunk)
for emb, chunk in zip(embeddings, chunks)
]
scores.sort(key=lambda x: x[0], reverse=True)
return [chunk for _, chunk in scores[:top_k]]
def build_prompt(question, context_chunks):
"""Assemble a RAG prompt from the question and retrieved chunks."""
context = "\n\n".join(context_chunks)
prompt = (
"You are a helpful assistant. Answer the question using only the "
"context below.\n"
'If the answer is not in the context, say "I don\'t know."\n\n'
f"Context:\n{context}\n\n"
f"Question:\n{question}"
)
return prompt
def generate_answer(client, prompt):
"""Send the prompt to Gemini and return the generated answer."""
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=prompt,
)
return response.text
def save_embeddings(chunks, embeddings, cache_path):
"""Save chunks and embeddings to a JSON file."""
data = {"chunks": chunks, "embeddings": embeddings}
with open(cache_path, "w") as f:
json.dump(data, f)
def load_embeddings(cache_path):
"""Load chunks and embeddings from a JSON file."""
if not os.path.exists(cache_path):
return None
with open(cache_path) as f:
data = json.load(f)
return data["chunks"], data["embeddings"]
def main():
pdf_path = sys.argv[1]
question = sys.argv[2]
cache_path = pdf_path + ".cache.json"
client = create_client()
print(f"Loading {pdf_path}...")
cached = load_embeddings(cache_path)
if cached:
chunks, embeddings = cached
print(f"Loaded cache from {cache_path}")
else:
text = extract_text(pdf_path)
chunks = chunk_text(text)
print(f"No cache found. Embedding {len(chunks)} chunks...")
embeddings = embed_all_chunks(client, chunks)
save_embeddings(chunks, embeddings, cache_path)
print(f"Cache saved to {cache_path}")
top_chunks = search(client, question, chunks, embeddings)
prompt = build_prompt(question, top_chunks)
answer = generate_answer(client, prompt)
print("\nAnswer:")
print(answer)
if __name__ == "__main__":
main()Beyond vector search
The app you just built uses vector-only retrieval. That works for most questions, but as the "When retrieval fails" section showed, it struggles with exact keyword matches.
Production RAG systems solve this with hybrid search — running two searches in parallel and merging the results:
- Vector search finds chunks that are semantically similar to the question. It handles paraphrasing, synonyms, and abstract queries well.
- Keyword search (BM25) finds chunks that contain the exact words from the question. It handles names, IDs, codes, and specific figures well.
A reranker then scores the combined results. Cross-encoder models like ms-marco-MiniLM read the full question and each chunk together, producing a more accurate relevance score than either search method alone. Recent benchmarks show hybrid search with reranking improves accuracy by 30-40% over vector-only retrieval.
You do not need a vector database to add hybrid search. Python's rank_bm25 library handles keyword scoring in about 10 lines of code. Combined with the cosine similarity search you already built, you get a production-grade retrieval pipeline without any framework.
Here are two more improvements worth exploring:
- Smarter chunking — split on paragraph boundaries instead of fixed character counts. This avoids cutting sentences in half.
- Multiple PDFs — load several documents into the same embedding cache and search across all of them.
Go deeper with the interactive course
This tutorial gives you the code. The Build a RAG App — Chat with Your PDFs interactive course gives you the full learning experience: step-by-step instructions, an in-browser code editor, and instant validation as you build each function.
You write the code yourself, one function at a time. The course checks your work after every step. Five lessons, 26 chapters, about 3 hours from start to finish.

CTO & Co-founder of DevGuild. Making coding education feel like a game, not a lecture. Shipping features at an unreasonable pace.
@RealDevGuild