Parse Command-Line Arguments

Read the PDF path and question from the terminal using sys.argv

💻

Writing code and entering commands is only available on desktop. Open this page on a larger screen to complete this chapter.

From functions to a working app

You now have every function the RAG pipeline needs: extract, chunk, embed, search, prompt, generate, print, and cache. But they are loose pieces — nothing calls them in order. The next three chapters wire them together into a main() function you can run from the terminal.

After these chapters, you will run:

python app.py invoice.pdf "What is the total amount due?"

and get back an answer grounded in the PDF's content. This chapter handles the first step: reading the user's input.

Reading arguments from the command line

Python's sys.argv is a list of strings that holds whatever the user typed after python:

Index	Value
`sys.argv[0]`	`"app.py"` (the script name)
`sys.argv[1]`	`"invoice.pdf"` (the PDF path)
`sys.argv[2]`	`"What is the total amount due?"` (the question)

The main() function reads sys.argv[1] and sys.argv[2] to get the PDF path and the question.

The `if name` guard

The line if __name__ == "__main__" tells Python: *run main() only when this file is executed directly*. If another script imports a function from this file (like from app import search), the guard prevents the full pipeline from running automatically.

The cache path convention

Each PDF gets its own cache file. The path is the PDF path with .cache.json appended:

invoice.pdf → invoice.pdf.cache.json
report.pdf  → report.pdf.cache.json

This chapter sets up cache_path as a variable. The next chapter uses it to decide whether to embed or load from disk.

Instructions

Start the main function. The starter code provides the signature and the if __name__ guard.

Create a variable named pdf_path. Assign it sys.argv[1].
Create a variable named question. Assign it sys.argv[2].
Create a variable named cache_path. Assign it pdf_path + ".cache.json".
Create a variable named client. Assign it create_client().

← Previous Chapter Next Chapter →

import json
import os
import sys
import time
import numpy as np
import pypdf
from dotenv import load_dotenv
from google import genai
from google.genai import types

def extract_text(pdf_path):
    reader = pypdf.PdfReader(pdf_path)
    pages = [page.extract_text() for page in reader.pages]
    return "\n".join(pages)

def chunk_text(text, chunk_size=500, overlap=100):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i : i + chunk_size])
    return chunks

def preview_chunks(chunks):
    print(f"Total chunks: {len(chunks)}")
    print(f"First chunk:\n{chunks[0]}")

def create_client():
    load_dotenv()
    api_key = os.getenv("GEMINI_API_KEY")
    client = genai.Client(api_key=api_key)
    return client

def embed_text(client, text):
    result = client.models.embed_content(model="gemini-embedding-001", contents=text, config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT"))
    return result.embeddings[0].values

def embed_all_chunks(client, chunks):
    BATCH_SIZE = 90
    embeddings = []
    for i in range(0, len(chunks), BATCH_SIZE):
        batch = chunks[i : i + BATCH_SIZE]
        for chunk in batch:
            embeddings.append(embed_text(client, chunk))
        if i + BATCH_SIZE < len(chunks):
            print("Rate limit pause — waiting 60 seconds...")
            time.sleep(60)
    return embeddings

def cosine_similarity(vec_a, vec_b):
    dot = np.dot(vec_a, vec_b)
    norm = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
    return dot / norm

def search(client, query, chunks, embeddings, top_k=3):
    result = client.models.embed_content(model="gemini-embedding-001", contents=query, config=types.EmbedContentConfig(task_type="RETRIEVAL_QUERY"))
    query_vector = result.embeddings[0].values
    scores = [(cosine_similarity(query_vector, emb), chunk) for emb, chunk in zip(embeddings, chunks)]
    scores.sort(key=lambda x: x[0], reverse=True)
    return [chunk for _, chunk in scores[:top_k]]

def test_search(client, pdf_path, question):
    text = extract_text(pdf_path)
    chunks = chunk_text(text)
    embeddings = embed_all_chunks(client, chunks)
    results = search(client, question, chunks, embeddings)
    for i, chunk in enumerate(results, 1):
        print(f"Result {i}:\n{chunk}\n")

def build_prompt(question, context_chunks):
    context = "\n\n".join(context_chunks)
    prompt = f"You are a helpful assistant. Answer the question using only the context below.\nIf the answer is not in the context, say \"I don't know.\"\n\nContext:\n{context}\n\nQuestion:\n{question}"
    return prompt

def generate_answer(client, prompt):
    response = client.models.generate_content(model="gemini-2.5-flash", contents=prompt)
    return response.text

def print_result(answer, source_chunks, show_sources=True):
    print("Answer:")
    print(answer)
    if show_sources:
        print("\nSources:")
        for i, chunk in enumerate(source_chunks, 1):
            print(f"Source {i}:\n{chunk}\n")

def save_embeddings(chunks, embeddings, cache_path):
    data = {"chunks": chunks, "embeddings": embeddings}
    with open(cache_path, "w") as f:
        json.dump(data, f)

def load_embeddings(cache_path):
    if not os.path.exists(cache_path):
        return None
    with open(cache_path) as f:
        data = json.load(f)
    return data["chunks"], data["embeddings"]

def main():
    # Step 1: Get pdf_path from sys.argv[1]
    # Step 2: Get question from sys.argv[2]
    # Step 3: Create cache_path (pdf_path + ".cache.json")
    # Step 4: Create client

if __name__ == "__main__":
    main()