Parse Command-Line Arguments
Read the PDF path and question from the terminal using sys.argv
Writing code and entering commands is only available on desktop. Open this page on a larger screen to complete this chapter.
From functions to a working app
You now have every function the RAG pipeline needs: extract, chunk, embed, search, prompt, generate, print, and cache. But they are loose pieces — nothing calls them in order. The next three chapters wire them together into a main() function you can run from the terminal.
After these chapters, you will run:
python app.py invoice.pdf "What is the total amount due?"and get back an answer grounded in the PDF's content. This chapter handles the first step: reading the user's input.
Reading arguments from the command line
Python's sys.argv is a list of strings that holds whatever the user typed after python:
| Index | Value |
|---|---|
sys.argv[0] | "app.py" (the script name) |
sys.argv[1] | "invoice.pdf" (the PDF path) |
sys.argv[2] | "What is the total amount due?" (the question) |
The main() function reads sys.argv[1] and sys.argv[2] to get the PDF path and the question.
The if __name__ guard
The line if __name__ == "__main__" tells Python: *run main() only when this file is executed directly*. If another script imports a function from this file (like from app import search), the guard prevents the full pipeline from running automatically.
The cache path convention
Each PDF gets its own cache file. The path is the PDF path with .cache.json appended:
invoice.pdf → invoice.pdf.cache.json
report.pdf → report.pdf.cache.jsonThis chapter sets up cache_path as a variable. The next chapter uses it to decide whether to embed or load from disk.
Instructions
Start the main function. The starter code provides the signature and the if __name__ guard.
- Create a variable named
pdf_path. Assign itsys.argv[1]. - Create a variable named
question. Assign itsys.argv[2]. - Create a variable named
cache_path. Assign itpdf_path + ".cache.json". - Create a variable named
client. Assign itcreate_client().
import json
import os
import sys
import time
import numpy as np
import pypdf
from dotenv import load_dotenv
from google import genai
from google.genai import types
def extract_text(pdf_path):
reader = pypdf.PdfReader(pdf_path)
pages = [page.extract_text() for page in reader.pages]
return "\n".join(pages)
def chunk_text(text, chunk_size=500, overlap=100):
chunks = []
for i in range(0, len(text), chunk_size - overlap):
chunks.append(text[i : i + chunk_size])
return chunks
def preview_chunks(chunks):
print(f"Total chunks: {len(chunks)}")
print(f"First chunk:\n{chunks[0]}")
def create_client():
load_dotenv()
api_key = os.getenv("GEMINI_API_KEY")
client = genai.Client(api_key=api_key)
return client
def embed_text(client, text):
result = client.models.embed_content(model="gemini-embedding-001", contents=text, config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT"))
return result.embeddings[0].values
def embed_all_chunks(client, chunks):
BATCH_SIZE = 90
embeddings = []
for i in range(0, len(chunks), BATCH_SIZE):
batch = chunks[i : i + BATCH_SIZE]
for chunk in batch:
embeddings.append(embed_text(client, chunk))
if i + BATCH_SIZE < len(chunks):
print("Rate limit pause — waiting 60 seconds...")
time.sleep(60)
return embeddings
def cosine_similarity(vec_a, vec_b):
dot = np.dot(vec_a, vec_b)
norm = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
return dot / norm
def search(client, query, chunks, embeddings, top_k=3):
result = client.models.embed_content(model="gemini-embedding-001", contents=query, config=types.EmbedContentConfig(task_type="RETRIEVAL_QUERY"))
query_vector = result.embeddings[0].values
scores = [(cosine_similarity(query_vector, emb), chunk) for emb, chunk in zip(embeddings, chunks)]
scores.sort(key=lambda x: x[0], reverse=True)
return [chunk for _, chunk in scores[:top_k]]
def test_search(client, pdf_path, question):
text = extract_text(pdf_path)
chunks = chunk_text(text)
embeddings = embed_all_chunks(client, chunks)
results = search(client, question, chunks, embeddings)
for i, chunk in enumerate(results, 1):
print(f"Result {i}:\n{chunk}\n")
def build_prompt(question, context_chunks):
context = "\n\n".join(context_chunks)
prompt = f"You are a helpful assistant. Answer the question using only the context below.\nIf the answer is not in the context, say \"I don't know.\"\n\nContext:\n{context}\n\nQuestion:\n{question}"
return prompt
def generate_answer(client, prompt):
response = client.models.generate_content(model="gemini-2.5-flash", contents=prompt)
return response.text
def print_result(answer, source_chunks, show_sources=True):
print("Answer:")
print(answer)
if show_sources:
print("\nSources:")
for i, chunk in enumerate(source_chunks, 1):
print(f"Source {i}:\n{chunk}\n")
def save_embeddings(chunks, embeddings, cache_path):
data = {"chunks": chunks, "embeddings": embeddings}
with open(cache_path, "w") as f:
json.dump(data, f)
def load_embeddings(cache_path):
if not os.path.exists(cache_path):
return None
with open(cache_path) as f:
data = json.load(f)
return data["chunks"], data["embeddings"]
def main():
# Step 1: Get pdf_path from sys.argv[1]
# Step 2: Get question from sys.argv[2]
# Step 3: Create cache_path (pdf_path + ".cache.json")
# Step 4: Create client
if __name__ == "__main__":
main()
Interactive Code Editor
Sign in to write and run code, track your progress, and unlock all chapters.
Sign In to Start Coding